Overview
Teaching: 15 min Exercises: 0 minQuestions
FIXME
Objectives
Write Python programs to download data sets using simple REST APIs.
A growing number of organizations make data sets available on the web in a style called REST, which stands for REpresentational State Transfer. The details (and ideology) aren’t important; what matters is that when REST is used, every data set is identified by a URL and can be accessed through a set of functions called an application programming interface (API).
For example, the World Bank’s Climate Data API provides data generated by 15 global circulation models. According to the API’s home page, the data sets containing yearly averages for various values are identified by URLs of the form:
http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/var/year/iso3.ext
where:
pr
(for precipitation) or tas
(for “temperature at surface”);For example, if we want the average annual temperature in Canada as a CSV file, the URL is:
http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/CAN.csv
If we paste that URL into a browser, it displays:
year,data
1901,-7.67241907119751
1902,-7.862711429595947
1903,-7.910782814025879
...
2007,-6.819293975830078
2008,-7.2008957862854
2009,-6.997011661529541
Behind the Scenes
This particular data set might be stored in a file on the World Bank’s server, or that server might:
- Receive our URL.
- Break it into pieces.
- Extract the three key fields (the variable, the country code, and the desired format).
- Fetch the desired data from a database.
- Format the data as CSV.
- Send that to our browser.
As long as the World Bank doesn’t change its URLs, we don’t need to know which method it’s using and it can switch back and forth between them without breaking our programs.
If we only wanted to look at data for a couple of countries, we could just download those files one by one. But we want to compare data for many different pairs of countries, so we should write a program.
Python has a library called urllib2
for working with URLs.
It is clumsy to use, though, so many people (including us) prefer
a newer library called Requests.
To install it, run the command:
$ pip install requests
Installing with Pip
Note that
pip
is a program in its own right, so the command above must be run in the shell, and not from within Python itself.
If Requests is not already installed,
pip
’s output is:
Downloading/unpacking requests
Downloading requests-2.7.0-py2.py3-none-any.whl (470kB): 470kB downloaded
Installing collected packages: requests
Successfully installed requests
Cleaning up...
If it’s already present, the output will be:
Requirement already satisfied (use --upgrade to upgrade): requests in /Users/swc/anaconda/lib/python2.7/site-packages
Cleaning up...
Either way, we can now get the data we want like this:
import requests
url = 'http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/CAN.csv'
response = requests.get(url)
if response.status_code != 200:
print('Failed to get data:', response.status_code)
else:
print('First 100 characters of data are')
print(response.text[:100])
First 100 characters of data are
year,data
1901,-7.67241907119751
1902,-7.862711429595947
1903,-7.910782814025879
1904,-8.15572929382
The first line imports the requests
library.
The second defines the URL for the data we want;
we could just pass this URL as an argument to the requests.get
call on the third line,
but assigning it to a variable makes it easier to find.
requests.get
actually gets our data. More specifically, it:
climatedataapi.worldbank.org
server;/climateweb/rest/v1/country/cru/tas/year/CAN.csv
to that server;status_code
member variable to tell us whether the request succeeded or not; andtext
member variable.The server can return many different status codes; the most common are:
Code | Name | Meaning |
---|---|---|
200 | OK | The request has succeeded. |
204 | No Content | The server has completed the request, but doesn’t need to return any data. |
400 | Bad Request | The request is badly formatted. |
401 | Unauthorized | The request requires authentication. |
404 | Not Found | The requested resource could not be found. |
408 | Timeout | The server gave up waiting for the client. |
418 | I’m a teapot | No, really… |
500 | Internal Server Error | An error occurred in the server. |
200 (OK) is the one we want; if we get anything else, the response probably doesn’t contain actual data (though it might contain an error message).
Some People Don’t Follow the Rules
Unfortunately, some sites don’t return a meaningful status code. Instead, they return 200 for everything, then put an error message (if appropriate) in the text of the response. This works when the result is being displayed to a human being, but fails miserably when the “reader” is a program that can’t actually read.
Defining REST API
A REST API is: 1. A data format. 2. A way of accessing data via an URL. 3. Less work for the server. 4. Only accessable via Python libraries like Requests.
Get Data for Guatemala
Modify the little program above to fetch temperatures for Guatemala.
How Hot is Afghanistan?
Read the documentation for the Climate Data API, and then write URLs to find the annual average temperature for Afghanistan between 1980 and 1999.
Key Points
FIXME