In this post, I will cover getting open source COVID-19 data for the United States using Julia. The data pipeline demonstrated here is very simple example and could easily be adapted into a Prefect, Apache NiFi or Apache AirFlow ETL process.
Data Search
Performing a quick search on DuckDuckGo I got The COVID Tracking Project, among other sources I will explore later.
Other sources:
- https://ourworldindata.org/coronavirus-source-data
- https://datascience.nih.gov/covid-19-open-access-resources
- https://github.com/nytimes/covid-19-data
- https://datausa.io/coronavirus
I also would like to get population information for each state so I can assess COVID data for each state with respect to their populations. Another search got me the World Population Review website, which contains population information for multiple international and domestic dimensions in a downloadable format (CSV and JSON).
Data API and Download
Julia Packages Used
For this post, I will use the following Julia packages in my script:
using CSVFiles # version 0.16.1
using DataFrames # version 0.21.8
using Dates
using Chain # version 0.4.2
COVID API
After reading through website about the API, I constructed the following in Julia to pull the data. Since I wanted to get all daily state data, I opt’d for the all states historical daily values. Using their API you can also get the US as a whole and state specific queries. You can get the data in either CSV or JSON format. The page also provides information about each of the fields and whether a field has be deprecated.
base_url = "https://api.covidtracking.com"
all_states = "/v1/states/daily.csv"
state_specific_api = "/v1/states/{state}/daily.csv"
If I wanted to get current daily data for all states to best suit project needs, I can use the following as well :
current_states = "/v1/states/current.csv"
current_state = "/v1/states/{state}/current.csv"
Finally, I download the data. Modify the URL argument to tailor to your needs, as well as the destination filename and location.
base_dir = "/home/linux/ProblemXSolutions.com/DataProjects/covid19"
filename = "/data/all_states_daily_covid_jl.csv"
download(string(base_url, all_states), string(base_dir, filename));
Population Data
Next, I’m getting the population data by state. This did not list an API, so I just downloaded the file to the same location as the COVID data. When attempting to get the CSV with Julia’s download
function, I encountered an error, so I just did it manually.
State Look Up Table
After looking at the data for each dataset, I decided needed a simple lookup table to translate state names to their abbreviated form. Since I already had this on hand, it was easy to just re-use.
Reading in the Data
Now that I have all my data, its time to read in each file.
state_data = load(string(base_dir, filename)) |> DataFrame
state_pop_data = load(string(base_dir,"/data/state_populations.csv")) |> DataFrame
state_name_lookup_data = load(string(base_dir, "/data/state_lu.csv")) |> DataFrame
Initial Processing
State Population Data Processing
After exploring the state population data, I decided to reduce the dataset to what I need using the following code:
state_pop_reduced =
@chain state_pop_data begin
select(([:State, :Pop, :density]))
rename(Dict(:State => :state))
transform(:state => (x -> strip.(x)) => :state)
end
state_pop_data = Nothing
state_name_lookup_data.state = strip.(state_name_lookup_data.state)
state_pop_reduced =
@chain state_pop_reduced begin
leftjoin(
state_name_lookup_data,
on =:state)
select(Not([:state]))
rename(Dict(:state_abr => :state))
end
If you read the R version of this post or the Python version, Julia more resembles R in cleanliness and the chaining/piping helps with that.
State COVID Data Processing
After exploring the COVID19 data, I decided to remove some of the original fields and calculate some fields. First, I perform a join on my population data by state.
Since Julia is sensitive when applying a function to a field with missing
values, I will remove those data points on this particular dataframe. I can safely do this because I know the values being removed are territories and districts that I am not processing. There are processes you can do to work with missing values but I don’t need to implement for this effort
I also transformed the date field, which is in a “yyyymmdd” format that was read to be a numeric datatype. A simple fix is to parse the field using Dates
package functions to convert it to a date data type. I also sorted the data in the part (in comparison to the Python version).
dropmissing!(state_pop_reduced)
state_pop_reduced.state = strip.(state_pop_reduced.state)
state_data_enhanced =
@chain state_data begin
transform(:state => (x -> strip.(x)) => :state,
:date => (x -> Date.(string.(x), Dates.DateFormat("yyyymmdd"))) => :date)
leftjoin(state_pop_reduced,
on =:state)
sort([:state,:date])
end
state_data = Nothing
Since Julia doesn’t have some of the same functions that are readily available like they are in Python’s pandas
package or R’s base, zoo
and dplyr
libraries, I created a few functions to perform the same tasks.
function deltas(v, k)
[ 1 <= i-k <= length(v) ? v[i]-v[i-k] : 0 for i=1:length(v) ]
end
function pop_adjusted(x, y)
# Adjusting for population to help normalize numbers between states.
# (daily_{cases,deaths} / population of state) * 100,000
[(x / y) * 100000]
end
function moving_average(v,k)
[ 1 <= i-k <= length(v) ? sum(@view v[(i-k+1):i])/k : 0 for i in 1:length(v) ]
end
Data Enhancement
Next I group the data on state before performing the series of transformations to get new calculated fields. Using the transform
function, I created the “daily_recovered” field and running the function my user defined function deltas
on the original field “recovered”. This matches up with the dataset’s existing fields annotating daily changes rather than just accumulated recovery numbers.
Next, I perform another set of transformations to calculate the adjusted numbers (cases, recoveries and deaths) based on state specific populations. The idea behind this is to get some normalization between states, since population sizes can vary quite a bit by state. When looking at data this can tell two different stories and can be subjective to what is being conveyed and interpreted from the results. What is important? Actual volume or per capita volume? To calculate per capita numbers I used the following formula:
(daily_{cases,recover,deaths} / population of state) * 100,000
Then I roll into another transform
function to calculate the rolling averages of daily cases, recoveries, and deaths. This helps to smooth out the data from the data to numbers and is typically what you see in various charts in the media or other graphics on the internet. The reason for breaking up the transform
function is two-fold. It groups together like transformation operations, in this case daily numbers vs rolling averages. Another reason is if I need to compute on a field just created in the transform
function, this needs to be done in a separate function, otherwise the transformation will not see the new field in the same transform
function.
In the final round of transformations, I perform another transform
function on the previous calculations to get the rolling 7 day average based on actual volume and per capita numbers.
state_data_enhanced =
@chain state_data_enhanced begin
groupby(:state)
transform(:recovered => (x -> deltas(x, 1)) => :daily_recover)
transform(
[:positiveIncrease, :Pop] => ByRow(pop_adjusted) => :daily_cases_adj,
[:daily_recover, :Pop] => ByRow(pop_adjusted) => :daily_recover_adj,
[:deathIncrease, :Pop] => ByRow(pop_adjusted) => :daily_deaths_adj)
transform(
:positiveIncrease => (x -> moving_average(x, 7)) => :active_roll7,
:daily_recover => (x -> moving_average(x, 7)) => :recovered_roll7,
:deathIncrease => (x -> moving_average(x, 7)) => :deaths_roll7,
:daily_cases_adj => (x -> moving_average(x, 7)) => :active_roll7_adj,
:daily_recover_adj => (x -> moving_average(x, 7)) => :recovered_roll7_adj,
:daily_deaths_adj => (x -> moving_average(x, 7)) => :deaths_roll7_adj)
end
Differences in Python and R versions
In Python, I had to break up the process into two separate sets of operations. In the original R version, I could use piping to implement the process in one set of operations. Using Julia’s Chain
package, I can leverage the same set of operations like I could in R.
Originally I used a broke up version in Julia to help me troubleshoot and debug but once it was operating the same way as R, I reverted back to the R version. Grouping like operations together also helps to read and debug the code and I think you could easily implement both styles depending on your preferences.
Data Reduction
With all the fields in the data, I decided to reduce the fields to what I’m interested. I always have access to the original so I not concerned I need to to review this list and reprocess again.
state_enhanced_reduced =
@chain state_data_enhanced begin
select(
:date, :state,
:positive, :recovered, :death,
:positiveIncrease, :daily_cases_adj,
:active_roll7, :active_roll7_adj,
:daily_recover, :daily_recover_adj,
:recovered_roll7, :recovered_roll7_adj,
:deathIncrease, :daily_deaths_adj, :deaths_roll7,
:deaths_roll7_adj,
:hospitalizedIncrease, :hospitalizedCurrently,
:inIcuCurrently
)
end
Write Out the Processed Data
Now that I have my initially processed data, I will write out the results for ease of use during follow-on work. You will notice I have changed the file endings to “_jl” to differentiate between output files in other languages. This helps to verify and validate the outputs against each other.
save(string(base_dir, "/data/state_data_enhanced_jl.csv"), state_data_enhanced)
save(string(base_dir, "/data/state_enhanced_reduced_jl.csv"), state_enhanced_reduced)
GitHub Link
- https://github.com/problemxsolutions/covid19/blob/master/scripts/r/covid_data_processing.R
- https://github.com/problemxsolutions/covid19/blob/master/scripts/python/covid_data_processing.py
- https://github.com/problemxsolutions/covid19/blob/master/scripts/julia/covid_data_processing.jl