Menu Home

External Data: Acquire Data using APIs

Data can come from internal or external sources. In many cases, companies need access to data that is not provided by the ongoing operation of the business. For example, forestry businesses may rely on products and applications developed using regional timber prices, historical weather patterns, location-based services, or market indices.

In our modern connected world, data acquisition is an important business function, and most of us are comfortable locating data on the Internet. But what does that process look like, and how efficient is it? This post will look at one way that companies can acquire and interact with secondary data in a programmatic way through Application Programming Interfaces (APIs).

The first half of this post will provide a brief introduction to APIs and the basics of how they work.

The last half of this post will demonstrate a technical example of accessing data from an API. This part is intended for developers and/or analysts with a technical background, but non-technical people may also benefit from seeing how this data can be useful to an organization.

What is an API?

Simply put, an API provides a convenient way for two applications to communicate with one other. Modern software is designed for connectivity, and Web APIs provide the interface for an application (aka “frontend” or “client”) to request content from a service (aka “backend” or “sever”). I like to think of APIs as a window to a “library of data” hosted by another organization, with guidelines on what data is available and how to access it.

Much of the modern internet runs on APIs. For instance, when users interact with mobile apps, there is a good chance that an API is providing content such as maps, travel details or weather information. Developers can use these APIs to build rich, interactive content into their applications, but for the Data Analyst, these APIs can also provide raw datasets that can be further processed and analyzed.

Companies, organizations and agencies create public and private APIs to entice developers to build solutions using their data. Organizations can use APIs to monetize their data by providing a tiered service of free and subscription access for users. API documentation is usually located under the Developer section of the organization’s website, but can also be found quickly using a Google search. For a list of compiled free public APIs see this link to Github Public APIs.

If you are new to APIs or if you just need a refresher, Zapier provides an excellent resource that explains APIs in an easy to understand format.

How do APIs work?

When a client wants data provided by an API, the client will package the request and any associated parameters into the desired HTML request method. Parameters may include the authentication details (username, password, or api key), or query details and headers. The client software is typically a web browser, but could be any form of application that is capable of making HTML requests.

The server receives a request, authenticates the user, and returns an HTML response back to the client. In the case of some public APIs, the authentication stage may not be required. Responses are most often returned in a text-based format such as JSON (most popular) or XML.

The rest of the post will show a demonstration of how to access data from an API provided by the National Park Service (NPS) here in the U.S.

An API Demonstration

First I should note that there are very few public or private APIs available for forestry purposes. (Something I believe needs to change by the way, but I digress.) The only public forestry API that I currently know of is provided by the U.S. Forest Service’s Forest Inventory Analysis Program available at this website – FIADB API. The U.S. Department of Agriculture also provides a public API for the Recreation Information Database located at – Recreation.gov. If you know of others, I’d love to hear about it in an email, or better yet, a comment from you in this post.

The data problem we will attempt is to locate and extract data about campgrounds that are maintained by the NPS. We want a list a list of all campgrounds on National Parks in a given State, and then we want to extract the features that are of particular interest. For this purpose our resulting feature set will be the name of the campsite, the latitude and longitude coordinates, and the total number of campsites available. Finally, we want to convert the list to a Pandas DataFrame so that we can further process the dataset.

You may be thinking at this point, “I’m not seeing the benefit. Can’t I just find and download this information from their website?”. The simple answer is “Sure”, you can download a GIS layer for each Park’s campsites in the NPS Open-data ArcGIS Web portal located at the following site – NPS Open Data. The drawback with this approach is you have to manually download each dataset, then combine them into a single set, which is a tedious time-consuming process. With one API call you can get a dataset of all the campsites for a particular State. Without further ado, let’s see how this is done.

PART 1

To access the campground data the following HTTP GET request is sent to the API endpoint. An endpoint is just the resource location (i.e.; URL) that we are pointing to for the request. For our case the endpoint is /campgrounds which is appended to the Base URL for this API.

[ Base URL: developer.nps.gov/api/v1 ]

This API endpoint can be accessed using the curl Linux command line tool, the Python Requests library, Postman, or a similar tool. We won’t go into detail about making HTTP requests here, but know that I will be using the Python Requests library for this demonstration.

The NPS API is freely available to the public, but it requires authentication. Users need to sign up to use the NPS API service, in which case a unique API key will be provided. The sign up form is available at the NPS Get Started page . This key is our pass to use the service. Also, the NPS documentation describes several ways we can pass the key to the API using either an HTTP Header or a Query string. Since I’ve already registered for an API key, I will pass the API key, stored in the variable api_key, as a request header in the first line of code below. Here is the custom Python function that retrieves the API data.

def apiGET(url, api_key, params=None):
    '''retrieve api data using HTTP GET method on the 
    /campgrounds endpoint'''
    # pass the api key in the headers
    headers = {'X-Api-Key': api_key}
    try:
        response = requests.get(url,
            headers=headers, params=params)
        response.raise_for_status()
        return response

    except HTTPError:
        print(f'''An HTTP error occured.
                  Reponse code {response.status_code}''')

After a successful request (status code = 200), the data is sent from the server to the client in JSON format. Note that I passed the URL, an API key, and any parameters I want to include in the request to this function. Now, since we only need a subset of the features returned, we need to further parse the data. This is done in the next step.

PART II

Now we need to parse the returned JSON data, creating our desired feature set. For this part I extract the data into a list of dictionaries containing the four features we are interested in (name, lat, long, and number of campsites), rather than storing all the features returned. This brings up a good point I’ve learned over the years.

When it comes to data analysis, limiting a dataset to include only the features needed is far more efficient than processing large amounts of irrelevant data. Irrelevant data is expensive!

The following function will handle this step.

def parseCampsites(res):
    '''parse an api response from national parks campground data.
        response['data'] will contain all the returned data for
        the endpont.  We want to extract the camp name, lat, long,
        and the total number of campsites.
    '''
    res_json = res.json()
    camps = list()
    for i in res_json['data']:
        camps.append({'name': i['name'],
                    'lat': i['latitude'],
                    'long': i['longitude'],
                    'total_sites': i['campsites']['totalSites']})
    return camps

Part III

Now that we have parsed the data to our desired subset of features, we want to convert it to the format desired, namely a Pandas DataFrame. The following single line of code will do this for us (# is a code comment).

# convert raw data to a Pandas DataFrame
df = pd.DataFrame(camps)

See how simple this was! The reason we converted the data to a list of dictionaries is so we could easily import it into a Pandas DataFrame. After running the complete code in a Terminal, the following image shows a partial set of the results for campgrounds in Texas.

Now that we have the data in a desirable format we can do further processing on it or convert it to another format. For example, since the dataset already has location features available provided by our lat/long columns, we could easily convert this dataset to a GeoDataFrame, and from there to an ESRI Shapefile or an OGR data source. This is one of the neat features of learning APIs, is that we can get the raw data, then convert it to whatever formate we desire, all done programmatically.

In this post we saw that APIs can be a valuable source of raw data for organizations. Tapping into the data goldmine may require a technical person such as an Analyst or a Developer, nonetheless the process is pretty simple and straightforward. Organizations can use API data for analysis or to build interactive components into new or existing applications. Readers who are interested can learn more about the NPS API at the following website – NPS API Documentation.

Categories: Data Analytics Pandas DataFrame Python

JwL

2 replies

  1. Great post, I feel ashamed that I never used an API to get data to a Panda data frame when it is so easy.

    I agree that there are few public forest resources that have API:s. Here in Sweden the Forest Agency have a lot that could be used, like forest parameters based on the national Aerial Lidar survey.

    It was hard to find the api-key from nps and I think it would be good to include the URL in the first code section.

    Keep on the good work.

    Like

    1. Johan, thanks for the comments and for pointing out the need to include the api-key url. I have updated the post to include a link of how to sign up with the NPS to receive a personal API key.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: