Introduction:
As a data analyst, gathering accurate and relevant data is crucial for insightful analysis and decision-making. One of the primary methods to obtain such data is by fetching it from APIs provided by various sources. This process involves connecting to different APIs, retrieving the necessary data, and then cleaning and preparing it for comprehensive analysis. By efficiently managing this data gathering process, we can ensure the integrity and reliability of our analysis, leading to more precise and actionable insights.
Tutorial:
First, open your preferred IDE to start the Data Analysis Process (DAP). I recommend using Jupyter Notebook or Google Colab because they support many libraries.
Step 1: Import the two library modules, requests and pandas.
import requests
import pandas as pd
Step 2: Request data from the API URL.
Step 3: Convert the fetched data into a DataFrame using the pd.DataFrame()
method.
# Syntax :
var_name = requests.get('url')
dataframe_name = pd.DataFrame(var_name.json())
Here, I'm going to import a population dataset of the USA through an API.
An API returns data in JSON format, which contains multiple key-value pairs. For my analysis, I only need the key 'data' and its values.
Step 4: Print the dataframe for visual understanding.
response = requests.get('https://datausa.io/api/data?drilldowns=Nation&measures=Population')
df = pd.DataFrame(response.json()['data'])
print(df)
On printing my dataset, it will look like this. This dataset is only from page 1 of the entire original dataset from the API.
To fetch all the hundreds of pages of data, we need to loop through the pages of the API URL.
Step 5: Create an empty DataFrame using pandas to collect all the data with the pd.concat() function.
temp_df = pd.DataFrame()
Step 6: Check the API metadata to find out the number of pages.
Step 7: Loop through each page of the API, request the data for that page, and append it to the empty DataFrame.
for i in range(1,11):
response = requests.get('https://datausa.io/api/data?drilldowns=Nation&measures=Population')
df = pd.DataFrame(response.json()['data'])
temp_df = pd.concat([temp_df, df], ignore_index=True)
Step 8: I used the (ignore_index=True) parameter here because, while looping, each page's index starts from 1. When appending this data, repeating the index would disrupt the consistency of the fetched data.
Step 9: Print the fetched dataframe.
print(temp_df)
The DataFrame will look like this:
The fetched data can span tens of thousands of rows, but in my case, it is only 200 rows.
The fetching time will depend on the number of pages.
Step 10: Export this DataFrame to your desired file format. I am exporting it to the standard CSV format.
# Syntax :
dataframe_name.to_csv('filename.csv')
temp_df.to_csv('slug-nation.csv', index=None)
Finally, you're ready to move to the next stage of the Data Analysis Process (DAP) for analyzing and cleaning the data.
Conclusion:
In conclusion, fetching data through APIs is a fundamental skill for data analysts, enabling access to vast and diverse datasets. By following the steps outlined in this tutorial, you can effectively gather, clean, and prepare data for analysis, ensuring its integrity and reliability. As you continue to refine your data analysis process, mastering API data retrieval will enhance your ability to generate precise and actionable insights, ultimately contributing to more informed decision-making.
Thankyou...