In this post, I will cover getting open source COVID-19 data for the United States using R. The data pipeline demonstrated here is very simple example and could easily be adapted into a Prefect, Apache NiFi or Apache AirFlow ETL process.
Data Search
Performing a quick search on DuckDuckGo I got The COVID Tracking Project, among other sources I will explore later.
Other sources:
- https://ourworldindata.org/coronavirus-source-data
- https://datascience.nih.gov/covid-19-open-access-resources
- https://github.com/nytimes/covid-19-data
- https://datausa.io/coronavirus
I also would like to get population information for each state so I can assess COVID data for each state with respect to their populations. Another search got me the World Population Review website, which contains population information for multiple international and domestic dimensions in a downloadable format (CSV and JSON).
Data API and Download
R Libraries Used
For this post, I will use the following R libraries in my script:
library(tidyverse)
library(magrittr)
library(lubridate)
library(zoo)
COVID API
After reading through website about the API, I constructed the following in R to pull the data. Since I wanted to get all daily state data, I opt’d for the all states historical daily values. Using their API you can also get the US as a whole and state specific queries. You can get the data in either CSV or JSON format. The page also provides information about each of the fields and whether a field has be deprecated.
base_url <- 'https://api.covidtracking.com'
all_states <- '/v1/states/daily.csv'
state_specific_api <- '/v1/states/{state}/daily.csv'
If I wanted to get current daily data for all states to best suit project needs, I can use the following as well :
current_states <- '/v1/states/current.csv'
current_state <- '/v1/states/{state}/current.csv'
Finally, I download the data. Modify the URL argument to tailor to your needs, as well as the destination filename and location.
filename <- paste0('./data/all_states_daily_covid.csv')
download.file(url = paste0(base_url, all_states),
destfile = filename)
Population Data
Next, I’m getting the population data by state. This did not list an API, so I just downloaded the file to the same location as the COVID data. When attempting to get the CSV with R’s download.file
function, I encountered an error, so I just did it manually.
State Look Up Table
After looking at the data for each dataset, I decided needed a simple lookup table to translate state names to their abbreviated form. Since I already had this on hand, it was easy to just re-use.
If you don’t have it you can also use the state
dataset in the datasets
library
Reading in the Data
Now that I have all my data, its time to read in each file.
state_data <- read_csv(file = filename)
state_pop_data <- read_csv(file = './data/state_populations.csv')
state_name_lookup_data <- read_csv(file = './data/state_lu.csv')
Initial Processing
After exploring the state population data, I decided to reduce the dataset to what I need using the following code:
state_pop_reduced <-
state_pop_data %>%
select(State, Pop, density) %>%
left_join(x = .,
y = state_name_lookup_data,
by = c('State' = 'state')) %>%
select(-1) %>%
rename(state = state_abr)
After exploring the COVID19 data, I decided to remove some of the original fields and calculate some fields. First, I perform a join on my population data by state.
Then I transform the data field, which is in a “yyyymmdd” format that was read to be a numeric datatype. A simple fix is to parse the field using lubridate
to convert it to a date format.
Next I handle some basic group and sorting operations before creating some new fields. Using the mutate
function, I create “daily_recovered” to match up with the datasets existing fields annotating daily changes rather than just accumulated recovery numbers.
Then I roll into another mutate
function to calculate the rolling averages of daily cases, recoveries, and deaths. This helps to smooth out the data from the data to numbers and is typically what you see in various charts in the media or other graphics on the internet. The reason for breaking up the mutate function is two-fold. It groups together like transformation operations, in this case daily numbers vs rolling averages. Another is if I need to compute on a field just created in the mutate function, this needs to be down in separate piped functions, otherwise the transformation will not see the new field in the same mutate function.
Next, I perform another set of transformations to calculate the adjusted numbers (cases, recoveries and deaths) based on state specific populations. The idea behind this is to get some normalization between states, since population sizes can vary quite a bit by state. When looking at data this can tell two different stories and can be subjective to what is being conveyed and interpreted from the results. What is important, actual volume or per capita volume? To calculate per capita numbers I used the following formula:
(daily_{cases,recover,deaths} / population of state) * 100,000
Finally, I perform another mutate on the previous calculations to get the rolling 7 day average based on per capita numbers.
state_data_enhanced <-
state_data %>%
left_join(x = .,
y = state_pop_reduced,
by = c('state')) %>%
mutate(date = ymd(date)) %>%
group_by((state)) %>%
arrange(state, date) %>%
mutate(
daily_recover = recovered - lag(recovered, default = first(recovered)))
) %>%
mutate(daily_cases_adj = (positiveIncrease / Pop) * 100000,
daily_recover_adj = (daily_recover / Pop) * 100000,
daily_deaths_adj = (deathIncrease / Pop) * 100000,
active_roll7 = rollmean(positiveIncrease, k = 7, fill = NA),
recovered_roll7 = rollmean(daily_recover_adj, k = 7, fill = NA),
deaths_roll7 = rollmean(deathIncrease, k = 7, fill = NA)) %>%
mutate(active_roll7_adj = rollmean(daily_cases_adj, k = 7, fill = NA),
recovered_roll7_adj = rollmean(daily_recover_adj, k = 7, fill = NA),
deaths_roll7_adj = rollmean(daily_deaths_adj, k = 7, fill = NA))
Data Reduction
With all the fields in the data, I decided to reduce the fields to what I’m interested. I always have access to the original so I not concerned I need to to review this list and reprocess again.
state_enhanced_reduced <-
state_data_enhanced %>%
select(date, state,
positive, recovered, death,
# daily_cases,
# daily_deaths,
positiveIncrease,
daily_cases_adj, active_roll7, active_roll7_adj,
daily_recover, daily_recover_adj, recovered_roll7,
recovered_roll7_adj,
deathIncrease, daily_deaths_adj, deaths_roll7,
deaths_roll7_adj,
hospitalizedIncrease, hospitalizedCurrently,
inIcuCurrently
)
Write Out the Processed Data
Now that I have my initially processed data, I will write out the results for ease of use during follow-on work.
write_csv(file = './data/state_data_enhanced.csv',
x = state_data_enhanced)
write_csv(file = './data/state_enhanced_reduced.csv',
x = state_enhanced_reduced)