#techtalk A journey into collecting Chinese air pollution data
(Author: Bo Li)
Scraping data off the Internet is always an interesting adventure in strange formats, broken HTML, or odd API implementations. Our service requires outdoor pollution measurements, unfortunately, a lot of territories provide this data through various formats and protocols, very rarely through a clean API.
We've been collecting data from all of the public Air Quality stations in China for a while now but, last week, an update in the CNEMC (China National Environmental Monitoring Centre) website broke our data collection. This lead us to a little work of reverse engineering a new data format that was not so straightforward to guess.
Let me share in detail how we managed to access data from CNEMC in the first place and how we managed to figure out the nature of the breaking change.
It begins with our OpenKongqi service
OpenKongqi is our open source outdoor air quality data service, aimed to provide customers with the outdoor air quality data from the closest station to their location.
For China, we were initially getting data from various sources: provincial websites, PM25.in | PM2.5(细颗粒物)及空气质量指数(AQI)实时查询! , http://86pm25.com . However, those data were incomplete, regularly unavailable, and only for major cities. As the geographic distribution of our customers expands in China, we needed to expand the OpenKongqi coverage as well.
With some effort, we got data from CNEMC
The China National Environmental Monitoring Centre (CNEMC) has an online air quality website making the country’s data available. A good start for scraping if it was not for Microsoft Silverlight, making the website only accessible with Internet Explorer on Windows. It seemed impossible at first since Silverlight (Microsoft’s attempt to rival with Adobe Flash) is an entirely binary solution with no easy way to inspect what’s going on under the hood.
After a bit of research online, we discovered that openAQ, an open-source worldwide public air quality data collection service, was using code from a GitHub repository called ChinaAQIData to collect data in China (GitHub - geoinsights/ChinaAQIData: 中国城市AQI数据(AQI Data of Cities in China) .
We dug into that project and realized that it used a python package (python-wcfbin, GitHub - ernw/python-wcfbin: A python library for converting between WCF binary xml and plain xml. ) to convert WCF binaries (Windows Communication Foundation binary files used for data communication in Silverlight) on the CNEMC website to standard XMLs.
Using that new knowledge, it was now possible to read the data stream from the Silverlight service and integrate it in openkongqi.
Voilà, our OpenKongqi service expanded geographically!
Here comes the breaking change
In early August, CNEMC made some updates and our scraping was not working anymore. A few changes were easy to spot, using a new domain name and new endpoints. The biggest issue to figure out was the new data format that looked binary.
It is now time to inspect the traffic between the Silverlight app and the server. Based on our experience with CNEMC, we should look for the file that possibly contains data from the whole country, which means we should probably first look at the WCF binary with the largest size.
After clicking around the website while capturing its traffic on Fiddler 4, we found the first candidate:
“GetAllAQIPublishLive” is fairly sizable and even has a filename that shows a lot of promise. Now let’s open it in the WCF Binary inspector:
The only useful information found in the WCF is a long ASCII string that doesn’t look like much. Because of the “=” padding at the end of the string, we suspected some form of web encoding, probably Base64. After decoding it, we end up with a seemingly unreadable binary string. A few more tests with different character encodings (UTF-8, GB2312, etc) did not help.
Next hunch was to try and figure out if it was some sort of file format, looking at the binary string in Hex, we use the first two bytes and treat them as a magic number or file signature.
The first two bytes are “78 9c”, looking up online, we found out that it’s the file signature for a compressed, no-preset-dictionary zlib file. That would mean, we could probably simply decompress it to get the original data.
Now, let’s try it again: first Base64-decode the string, and then zlib-decompress it.
Bingo! We got exactly what we were looking for – a standard XML containing real-time air quality data from public stations all around China. Now, all there’s left to do is to modify our data collection service to follow the same process. With China’s public outdoor data back on gams platform again, we reach now more than 7000 stations in the world.
No matter where you live, we empower you with knowledge about the air you breathe.
This article shows just one example of the daily challenges we encounter at gams. We are a team that seeks new solutions constantly and push ourselves to a higher level each day.