Getting COVID-19 Data (Julia) - Problem(x) Solutions

In this post, I will cover getting open source COVID-19 data for the United States using Julia. The data pipeline demonstrated here is very simple example and could easily be adapted into a Prefect, Apache NiFi or Apache AirFlow ETL process.

Data Search

Performing a quick search on DuckDuckGo I got The COVID Tracking Project, among other sources I will explore later.

Other sources:

I also would like to get population information for each state so I can assess COVID data for each state with respect to their populations. Another search got me the World Population Review website, which contains population information for multiple international and domestic dimensions in a downloadable format (CSV and JSON).

Data API and Download

Julia Packages Used

For this post, I will use the following Julia packages in my script:

using CSVFiles  # version 0.16.1
using DataFrames # version 0.21.8
using Dates
using Chain  # version 0.4.2

COVID API

After reading through website about the API, I constructed the following in Julia to pull the data. Since I wanted to get all daily state data, I opt’d for the all states historical daily values. Using their API you can also get the US as a whole and state specific queries. You can get the data in either CSV or JSON format. The page also provides information about each of the fields and whether a field has be deprecated.

base_url = "https://api.covidtracking.com"
all_states = "/v1/states/daily.csv"

state_specific_api = "/v1/states/{state}/daily.csv"

If I wanted to get current daily data for all states to best suit project needs, I can use the following as well :

current_states = "/v1/states/current.csv"

current_state = "/v1/states/{state}/current.csv"

Finally, I download the data. Modify the URL argument to tailor to your needs, as well as the destination filename and location.

base_dir = "/home/linux/ProblemXSolutions.com/DataProjects/covid19"

filename = "/data/all_states_daily_covid_jl.csv"

download(string(base_url, all_states), string(base_dir, filename));

Population Data

Next, I’m getting the population data by state. This did not list an API, so I just downloaded the file to the same location as the COVID data. When attempting to get the CSV with Julia’s download function, I encountered an error, so I just did it manually.

State Look Up Table

After looking at the data for each dataset, I decided needed a simple lookup table to translate state names to their abbreviated form. Since I already had this on hand, it was easy to just re-use.

Reading in the Data

Now that I have all my data, its time to read in each file.

state_data = load(string(base_dir, filename)) |> DataFrame

state_pop_data = load(string(base_dir,"/data/state_populations.csv")) |> DataFrame

state_name_lookup_data = load(string(base_dir, "/data/state_lu.csv")) |> DataFrame

Initial Processing

State Population Data Processing

After exploring the state population data, I decided to reduce the dataset to what I need using the following code:

state_pop_reduced =
     @chain state_pop_data begin
         select(([:State, :Pop, :density]))
         rename(Dict(:State => :state))
         transform(:state => (x -> strip.(x)) => :state)
     end
 state_pop_data = Nothing


state_name_lookup_data.state = strip.(state_name_lookup_data.state)
 state_pop_reduced =
     @chain state_pop_reduced begin
         leftjoin(
             state_name_lookup_data,
             on =:state)
         select(Not([:state]))
         rename(Dict(:state_abr => :state))
     end

If you read the R version of this post or the Python version, Julia more resembles R in cleanliness and the chaining/piping helps with that.

State COVID Data Processing

After exploring the COVID19 data, I decided to remove some of the original fields and calculate some fields. First, I perform a join on my population data by state.

Since Julia is sensitive when applying a function to a field with missing
values, I will remove those data points on this particular dataframe. I can safely do this because I know the values being removed are territories and districts that I am not processing. There are processes you can do to work with missing values but I don’t need to implement for this effort

I also transformed the date field, which is in a “yyyymmdd” format that was read to be a numeric datatype. A simple fix is to parse the field using Dates package functions to convert it to a date data type. I also sorted the data in the part (in comparison to the Python version).

dropmissing!(state_pop_reduced)
state_pop_reduced.state = strip.(state_pop_reduced.state)

state_data_enhanced =
     @chain state_data begin
         transform(:state => (x -> strip.(x)) => :state,
         :date => (x -> Date.(string.(x), Dates.DateFormat("yyyymmdd"))) => :date)
         leftjoin(state_pop_reduced,
             on =:state)
         sort([:state,:date])
     end
 state_data = Nothing

Since Julia doesn’t have some of the same functions that are readily available like they are in Python’s pandas package or R’s base, zoo and dplyr libraries, I created a few functions to perform the same tasks.

function deltas(v, k)
     [ 1 <= i-k <= length(v) ? v[i]-v[i-k] : 0 for i=1:length(v) ]
 end

function pop_adjusted(x, y)
     # Adjusting for population to help normalize numbers between states.
     # (daily_{cases,deaths} / population of state) * 100,000
     [(x / y) * 100000]
 end

function moving_average(v,k)
     [ 1 <= i-k <= length(v) ? sum(@view v[(i-k+1):i])/k : 0 for i in 1:length(v) ]
 end

Data Enhancement

Next I group the data on state before performing the series of transformations to get new calculated fields. Using the transform function, I created the “daily_recovered” field and running the function my user defined function deltas on the original field “recovered”. This matches up with the dataset’s existing fields annotating daily changes rather than just accumulated recovery numbers.

Next, I perform another set of transformations to calculate the adjusted numbers (cases, recoveries and deaths) based on state specific populations. The idea behind this is to get some normalization between states, since population sizes can vary quite a bit by state. When looking at data this can tell two different stories and can be subjective to what is being conveyed and interpreted from the results. What is important? Actual volume or per capita volume? To calculate per capita numbers I used the following formula:

(daily_{cases,recover,deaths} / population of state) * 100,000

Then I roll into another transform function to calculate the rolling averages of daily cases, recoveries, and deaths. This helps to smooth out the data from the data to numbers and is typically what you see in various charts in the media or other graphics on the internet. The reason for breaking up the transform function is two-fold. It groups together like transformation operations, in this case daily numbers vs rolling averages. Another reason is if I need to compute on a field just created in the transform function, this needs to be done in a separate function, otherwise the transformation will not see the new field in the same transform function.

In the final round of transformations, I perform another transform function on the previous calculations to get the rolling 7 day average based on actual volume and per capita numbers.

state_data_enhanced =
     @chain state_data_enhanced begin
         groupby(:state)
         transform(:recovered => (x -> deltas(x, 1)) => :daily_recover)
         transform(
             [:positiveIncrease, :Pop] => ByRow(pop_adjusted) => :daily_cases_adj,
             [:daily_recover, :Pop] => ByRow(pop_adjusted) => :daily_recover_adj,
             [:deathIncrease, :Pop] => ByRow(pop_adjusted) => :daily_deaths_adj)
         transform(
             :positiveIncrease => (x -> moving_average(x, 7)) => :active_roll7,
             :daily_recover => (x -> moving_average(x, 7)) => :recovered_roll7,
             :deathIncrease => (x -> moving_average(x, 7)) => :deaths_roll7,
             :daily_cases_adj => (x -> moving_average(x, 7)) => :active_roll7_adj,
             :daily_recover_adj => (x -> moving_average(x, 7)) => :recovered_roll7_adj,
             :daily_deaths_adj => (x -> moving_average(x, 7)) => :deaths_roll7_adj)
     end

Differences in Python and R versions

In Python, I had to break up the process into two separate sets of operations. In the original R version, I could use piping to implement the process in one set of operations. Using Julia’s Chain package, I can leverage the same set of operations like I could in R.

Originally I used a broke up version in Julia to help me troubleshoot and debug but once it was operating the same way as R, I reverted back to the R version. Grouping like operations together also helps to read and debug the code and I think you could easily implement both styles depending on your preferences.

Data Reduction

With all the fields in the data, I decided to reduce the fields to what I’m interested. I always have access to the original so I not concerned I need to to review this list and reprocess again.

state_enhanced_reduced =
     @chain state_data_enhanced begin
         select(
             :date, :state,
             :positive, :recovered, :death,
             :positiveIncrease, :daily_cases_adj, 
             :active_roll7, :active_roll7_adj,
             :daily_recover, :daily_recover_adj, 
             :recovered_roll7, :recovered_roll7_adj,
             :deathIncrease, :daily_deaths_adj, :deaths_roll7,
             :deaths_roll7_adj,
             :hospitalizedIncrease, :hospitalizedCurrently,
             :inIcuCurrently
         )
     end

Write Out the Processed Data

Now that I have my initially processed data, I will write out the results for ease of use during follow-on work. You will notice I have changed the file endings to “_jl” to differentiate between output files in other languages. This helps to verify and validate the outputs against each other.

save(string(base_dir, "/data/state_data_enhanced_jl.csv"), state_data_enhanced)
save(string(base_dir, "/data/state_enhanced_reduced_jl.csv"), state_enhanced_reduced)

GitHub Link

https://github.com/problemxsolutions/covid19/blob/master/scripts/r/covid_data_processing.R
https://github.com/problemxsolutions/covid19/blob/master/scripts/python/covid_data_processing.py
https://github.com/problemxsolutions/covid19/blob/master/scripts/julia/covid_data_processing.jl