As part of my Getting COVID-19 Data posts in R, Python and Julia, I will now advance to part two of the conversion process. As we saw in Part 1 of this post series, we duplicated the R scripts into the language specific script folder and changed the file extensions to the appropriate language. In this post I will demonstrate converting the script from R to Julia. I will describe and compare the functions along the way so if you are coming from the Julia side, you can relate to converting the opposite way, just as the R users can understand the conversion from their perspective.
Packages and Libraries
Packages and libraries provide additional modular functionality on top of each programming language’s foundation. R uses the function library()
to load the a package, while Julia uses the import
function.
R
For the specific script, I’m loading tidyverse
, magrittr
and lubridate
latter on I also use a function in the package roll
. I tend to load up tidyverse
and magrittr
in my scripts because they load a bunch of functional packages. I could easily load just dplyr
and readr
for this script, but tidyverse
offers more convenience to load multiple useful pacakges. magrittr
provides pipe support within the tidyverse ecosystem of packages and function. lubridate
provides a wide range of functions to work with and parse date/time fields.
library(tidyverse)
library(magrittr)
library(lubridate)
R documentation on library (3.6)
Julia
For the Julia version of the script, I load up the different packages with the using
keyword that operates similar to the library function. In Julia, packages a more modular than either R or Python. Alternatively I could use the import
keyword which functionally operates like the Python import
keyword an requires dot notation to access the module’s functions.
using CSVFiles # v0.16.1
using DataFrames # v0.21.8
using Dates
using Chain # v0.4.2
Julia documentation on using
, import
(Julia v1.0+)
Punctuation
Before getting into the code, I want to briefly cover punctuation usages in both languages.
R
In R, you will notice the use of <-
. What this means is assigns. It assigns whatever is put on the right of the punctuation to the object on the left. You could also use the function assign
which takes arguments for the variable name and the value. You can also use the equals sign, =
, in place of the <-
.
The next common punctuation you will see in my scripts is %>%
, which is a forward pipe. It takes the object on the left and pipes it (feeds into) a function on the right. It is equivalent of the f(x) function notation to the x %>% f
. The beauty of pipes is we can continue to build a longer pipe of processes feeding inputs to more and more functions. You will see how I use it in my code below as I chain together multiple data processing and manipulations.
The next is the combination of the pipe and the assignment punctuation, %<>%. What this does is it pipes the variable input on the left to the function on the right then assigns the output back to the input variable.
variable_a <- 4 # assign 4 to variable_a
variable_a = 4 # assign 4 to variable_a
# assign("variable_a", 4) assign 4 to variable_a
# all three methods are equivalent
variable_a %>% sqrt # pipe variable_a to the function sqrt (square root)
variable_a %<>% sqrt # pipe variable_a to the function sqrt (square root) then assign the result back to the input variable_a
R documentation on magrittr
(2.01.9000).
Julia
In Julia the assignment operator is just the equals sign, =
. You may see the exclamation point !
following a function (i.e. select!()
), which commits the operation to the object in-placerather than relying on an explicit variable =
assignment. You can use the |>
punctuation as a a forward pipe. Julia also uses a dot following a function (i.e f.())
to perform element-wise operations (called broadcasting)
When looking at Julia code it will look almost like a hybrid between R and Python, which is somewhat by design.
base_url = "https://api.covidtracking.com" # assignment
select!(df, ([:State, :Pop, :density]))
load(string(base_dir, filename)) |> DataFrame # pipe
strip.(df.state) # broadcast strip function on state field in DataFrame df to remove whitespace
Julia documentation on punctuation, piping and broadcasting (Julia v1.0+)
Downloading Files
In each language there are several ways to get data from a URL. For each language I’ll demonstrate the method I use and show an alternative as well.
R
In R, I prepared the process by assigns components of the URLs to variables. I did the same for the filename as well. Since there are multiple data sources at the base URL, I specify each unique ending separately. I only use one of them in my script though. The others are for the reader.
I used the download.file
function to download the desired data file to my local computer. I prefer to have the source file locally rather than constantly downloading the same information while developing code. Alternatively you could read the data file at the source URL straight into a variable and skip the saving of the original file locally. This makes sense in an operational environment to be able to pull directly from the source whenever the script is run.
base_url <- 'https://api.covidtracking.com'
all_states <- '/v1/states/daily.csv'
state_specific_api <- '/v1/states/{state}/daily.csv'
current_states <- '/v1/states/current.csv'
current_state <- '/v1/states/{state}/current.csv'
filename <- paste0('./data/all_states_daily_covid.csv')
download.file(url = paste0(base_url, all_states),
destfile = filename)
# Alternative method to read the data straight to a variable
alt_method_data <- read_csv(file = paste0(base_url, all_states))
R documentation on download.file
(base utility functionality) and read_csv
(readr v1.6)
Julia
In Julia, I used the download
function from the base package. This function works like the version in R where you provide the URL and file destination to save the data.
Alternatively you could use the combination of HTTP.get
and CSV.Rows
to read the file directly from the URL as was explained in the R section above. Piping the output to DataFrame
converts structure to the desired object structure consistent with the remainder of the script.
base_url = "https://api.covidtracking.com"
all_states = "/v1/states/daily.csv"
state_specific_api = "/v1/states/{state}/daily.csv"
current_states = "/v1/states/current.csv"
current_state = "/v1/states/{state}/current.csv"
base_dir = "/home/linux/ProblemXSolutions.com/DataProjects/covid19"
filename = "/data/all_states_daily_covid_jl.csv"
url = string(base_url, all_states)
download(url, string(base_dir, filename));
# Alternative method to read the data straight to a variable
import HTTP, CSV
url = string(base_url, all_states)
df = HTTP.get(url).body |> CSV.Rows |> DataFrame
Julia documentation on download
(base), HTTP.get
(HTTP package) and CSV.Rows
(CSV package)
Reading/Writing CSV Files
The ability to read and write data is a very common and important task throughout the analytics/data science processes. In the subsections below I will show how to perform the operations for tabular data.
R
To read and write data to a csv file, we can use the readr
package in the tidyverse
package ecosystem. The function a easy to use and can be lazily applied (ie minimal arguments specified). The following snippets show both operations as demonstrated during my script.
In the first line, I read in data with the read_csv
function. I just need to reference the file location. I could define a lot more like delimiters, column headings, datatypes, skip lines, missing values and reading in a subset of the data.
In the second function, write_csv
, I can save a data object to the designated path.
state_pop_data <- read_csv(file = './data/state_populations.csv')
write_csv(file = './data/state_data_enhanced.csv',
x = state_data_enhanced)
R documentation on read_csv and write_csv can be found in the readr package. Good cheatsheet.
Julia
The following are the 1:1 equivalents to the R versions and function about the same way. Both load
and save
functions are in the CSVFiles
package and operate on tabular data. Each function also has about the same optional function arguments to tailor how to read in the data and how to write out data.
state_pop_data = load(string(base_dir,"/data/state_populations.csv")) |> DataFrame
save(string(base_dir, "/data/state_data_enhanced_jl.csv"), state_data_enhanced)
# Alternative method using CSV package instead of CSVFiles
Import CSV
CSV.write(string(base_dir, "/data/state_enhanced_reduced_jl_mod1.csv"),
state_enhanced_reduced);
Julia documentation on load
and save
can be found in the CSVFiles package. Alternative CSV package.
Exploring the Data
Although I don’t show the data exploration process in the scripts, I’ll just briefly describe some of the basic commands that you can run
R
In the first line I get a view of the first few lines of the data.frame/tibble by default, though I can specify a number to return that many number of rows. The function head()
and tail()
operate the same, with one providing a view of the top and the other of the bottom of the data.frame/tibble. The next line provides a summary of the structure of the object using the str()
function. The final line will generate summary statistics of each column or provide basic info about non numeric columns.
data %>% head() # or tail()
data %>% str()
data %>% summary()
R documentation on head()
/tail()
, str()
and summary()
are part of the base and utility functions.
Julia
In the first line I get a view of the first line of the DataFrame by default, though I can specify a number to return that many number of rows. The functions first
and last
operate similarly. Without specifying a number, the response will only return 1 row. You need to specify the number of rows following the name of the DataFrame. These functions do not support piping when wanting to get more than 1 row. The function head()
and tail()
area available but there is a message regarding there deprecation.
The next line provides a summary of the structure providing only the number of rows and columns in the DataFrame. The final line will generate a summary of the data, providing basic statistics info of each column or provide basic info about non numeric columns.
data |> first # will only print first row. head() is deprecated
first(data, 5) # will print first 5 rows.
data |> tail # will only print first row. tail() is deprecated
last(data, 5) # will print last 5 rows.
data |> summary # just gives table dimensions (row, columns)
data |> describe
Julia documentation on first()
/last()
and describe()
can be found in the DataFrames
package.
Piping Operations
As mentioned above the use of pipes to direct inputs and outputs is a style preference. In R, I use pipes and place each data operation on its own line. it makes the code clean and allows me to chain operations into sets of operations. I have operated in both styles, with the same end results.
R
As mentioned in the punctuation section, in R we can use the %>%
to denote a forward piping. The value or object on the left gets piped into the function on the right of the punctuation. If we want to assign the final output back to the original input we can use %<>%
to accomplish that. A lot of times, I will test code with the standard pipe and when I’m satisfied with the end result, I modify the first pipe to %<>%
.
In the code snippet below, you can see each use of the forward pipe, %>%
. Its simple and clean.
state_pop_reduced <-
state_pop_data %>%
select(State, Pop, density) %>%
left_join(x = .,
y = state_name_lookup_data,
by = c('State' = 'state')) %>%
select(-1) %>%
rename(state = state_abr)
R documentation on magrittr
(2.01.9000).
Julia
In Julia there are a couple different options when it comes to piping operations. As mentioned in the punctuation section above, we can use the forward pipe|>
. There is a particularly handy package called Chains
that provides a clean approach, especially when operating on DataFrames.
The following code in Julia accomplishes the same set of operations as above. After mentioning @chain we state the input DataFrame then everything between the begin
and end
keywords denote each successive operation to perform. It looks every similar to R, minus the forward pipe punctuation.
state_pop_reduced =
@chain state_pop_reduced begin
leftjoin(
state_name_lookup_data,
on =:state)
select(Not([:state]))
rename(Dict(:state_abr => :state))
end
For a solid explanation on piping and variations in Julia, check out this link. Julia documentation for the Chain
package.
Select Statements
Being able to select or drop columns from a table structure is quite handy in sub-setting data. The following examples will demonstrate each operation.
R
The following code snippet shows selecting multiple columns by name to subset my piped in data. In the second part, I demonstrate dropping a column using a column index rather than by name. In this instance the resultant data.frame would only have 2 columns, “Pop” and “density”.
new_df <-
state_pop_data %>%
select(State, Pop, density)
# Drop a column using column index
new_df %>%
select(-1)
R Documentation on select
function from the dplyr
package.
Julia
The following Julia code conducts the same select operation as above. The desired columns are called by name or index number in the list. Julia using a different syntax for referring to column names with the appended colon, :columnname
, in front of the column name. To drop a column(s), we can use the Not()
function and specific which columns should not be selected, i.e. dropped.
select(state_pop_data, ([:State, :Pop, :density]))
# Commits the select statement on the DataFram provided, in-place
select!(state_pop_data, ([:State, :Pop, :density]))
# Drop column(s)
select(state_pop_reduced, Not([:state]))
# Commits Column Drop in-place
select!(state_pop_reduced, Not([:state]))
Julia documentation on select
/select!
, drop
in the DataFrames
package
DataFrame Joins
Joining tables and DataFrames is a powerful operation. In order to join or merge these data objects, we need to have a common key in each prior to. In the examples below, I have used a left join operation to merge the data from my state_name_lookup_data
table (right or y) to my state_pop_data
table (left or x). What this operation will do is keep all the records from the left table and join the table on the right only on the values that can be joined on. I will cover the different join operations in a separate post.
R
After piping the state_pop_data
through the select statement, it then flows into the leftjoin
function. I join the two tables on their common fields, which is “State” on the left and “state” on the right. Had the column names been the exact same I would just need to reference one. I could rename either of the columns to conform to the other as well. Unlike Julia, the join column on the right is not kept following the join. Following the join operation, I decide to drop the first column since it was not needed going forward. I also modify the one of the new column names to take on the name of the field I just dropped.
state_pop_reduced <-
state_pop_data %>%
select(State, Pop, density) %>%
left_join(x = .,
y = state_name_lookup_data,
by = c('State' = 'state')) %>%
select(-1) %>%
rename(state = state_abr)
R documentation on x_join
(dplyr
).
Julia
The same comments apply as in the above R section. Julia has the same joins that R does and they are structure similarly. In the code below, I had actually changed the column names to be the same so it looks different. The the columns names differed we would use :col_df1 => :col_df2
in place of the single column referenced below. To demonstrate the same tail end operations as above, you can see the dropping of the unneeded columns and renaming a column.
state_pop_reduced =
@chain state_pop_reduced begin
leftjoin(
state_name_lookup_data,
on =:state)
select(Not([:state]))
rename(Dict(:state_abr => :state))
end
Julia documentation on join
(DataFrames
).
Renaming Columns
Renaming an existing column is easy in both language. The function in both is the same as well. There a a number of different ways you can accomplish the task, I am showing one way of doing it, the way I used in my script.
R
In R we can rename any column in the pattern of “new_column_name” = “old_column_name”.
state_pop_reduced <-
state_pop_data %>%
select(State, Pop, density) %>%
left_join(x = .,
y = state_name_lookup_data,
by = c('State' = 'state')) %>%
select(-1) %>%
rename(state = state_abr)
R documentation on rename
(dplyr
).
Julia
In Julia we can rename any column in the pattern of :old_column_name => :new_column_name
. The punctuation rename!
determines whether the change is made in place or not.
state_pop_reduced =
@chain state_pop_reduced begin
leftjoin(
state_name_lookup_data,
on =:state)
select(Not([:state]))
rename(Dict(:state_abr => :state))
end
# Alternatively, in-place name change
rename!(df, Dict(:old_colname => :new_colname))
Julia documentation on rename
/ rename!
(DataFrames
).
Sorting/Ordering DataFrames
The next operation is sorting your DataFrame. In SQL you will be used to the Order By command to sort data in ascending or descending order by column. You can perform the same tasks in R and Julia as shown below.
R
In R , we can sort our data.frame using the arrange
function. Each column that controls the sorting is arranged in the order the columns appear in the function arguments. If you want a column to sort in descending order you need to place a desc()
function around the column name, otherwise it sorts in ascending order, just like SQL.
state_data %>%
arrange(state, date)
R documentation on arrange
(dplyr
).
Julia
In Julia, we use the sort
function. We specify the columns and order by which columns should control the sorting. To control the sort direction of each column, you need to use the order
function on each specific column and provide the boolean value to the rev
(reverse) argument. When you use this argument the size of the list should match the number of columns in the by
list.
sort(state_data_enhanced, [:state,:date])
sort(state_data_enhanced, [:state,order(:date, rev=true)])
# Commit In-place
sort(!state_data_enhanced, [:state,:date])
Julia documentation on sort
/sort!
, order
(
).DataFrames
Creating New Columns with Column Calculations
Enhancing our data is an important task. If the original data structure does not have a field or has multiple fields that could be reduced to one, we can accomplish this by creating or adding structure from calculations or logic statements. The join operation is a separate utility to enhance data as well, but the following operations allow you to create what you need from what you already have.
R
In R we can use the mutate
command to add structure through calculations or logic statements against the existing structure. My example script uses it in multiple instances. In the first instance of mutate
I calculate the difference between row values. In the second instance I create additional fields based on calculations to determine the adjusted variations of the input fields, as well as 7-day moving averages. The third instance, uses newly created fields from the second instance to create additional adjusted valued fields.
The reason for separating the instance is two-fold. First, it organizes the codes. Second, if I wish to perform an operation on a field, I just created I need to complete the operation before having access to is, otherwise I need to perform the same field calculation twice.
... %>%
mutate(daily_recover = recovered - lag(recovered, default = first(recovered))) %>%
mutate(daily_cases_adj = (positiveIncrease / Pop) * 100000,
daily_recover_adj = (daily_recover / Pop) * 100000,
daily_deaths_adj = (deathIncrease / Pop) * 100000,
active_roll7 = zoo::rollmean(positiveIncrease, k = 7, fill = NA),
recovered_roll7 = zoo::rollmean(daily_recover_adj, k = 7, fill = NA),
deaths_roll7 = zoo::rollmean(deathIncrease, k = 7, fill = NA)) %>%
mutate(active_roll7_adj = zoo::rollmean(daily_cases_adj, k = 7, fill = NA),
recovered_roll7_adj = zoo::rollmean(daily_recover_adj, k = 7, fill = NA),
deaths_roll7_adj = zoo::rollmean(daily_deaths_adj, k = 7, fill = NA))
R documentation on mutate
(dplyr
).
Julia
In the Julia version , it looks almost exactly the same as R and less like Python. Using the transform
function, I can perform the same tasks. Since I didn’t have readily available functions to calculate the difference between row values or the moving average, I create the functions, as well as the population adjust one too. The structure of applying the functions is [:col1,...] => (x, ... -> function(x,...)) => :new_colname
. The (x, ... -> function(x,...))
is creating anonymous function that provides you with more control over arguments. You see see that portion in the post or script on GitHub.
As you can see I structured my transform
statements in the same way I did the mutate
statements in R.
state_data_enhanced =
@chain state_data_enhanced begin
groupby(:state)
transform(:recovered => (x -> deltas(x, 1)) => :daily_recover)
transform(
[:positiveIncrease, :Pop] => ByRow(pop_adjusted) => :daily_cases_adj,
[:daily_recover, :Pop] => ByRow(pop_adjusted) => :daily_recover_adj,
[:deathIncrease, :Pop] => ByRow(pop_adjusted) => :daily_deaths_adj)
transform(
:positiveIncrease => (x -> moving_average(x, 7)) => :active_roll7,
:daily_recover => (x -> moving_average(x, 7)) => :recovered_roll7,
:deathIncrease => (x -> moving_average(x, 7)) => :deaths_roll7,
:daily_cases_adj => (x -> moving_average(x, 7)) => :active_roll7_adj,
:daily_recover_adj => (x -> moving_average(x, 7)) => :recovered_roll7_adj,
:daily_deaths_adj => (x -> moving_average(x, 7)) => :deaths_roll7_adj)
end
Julia documentation on transform
(
).DataFrames
What’s Next?
In the next follow up post, I will begin to demonstrate creating basic data visualizations from the data produced as a result of the initial processing scripts in both R and Julia.
GitHub Link
For the full scripts for each language referenced here, refer to my GitHub repo and links below.
- https://github.com/problemxsolutions/covid19/blob/master/scripts/r/covid_data_processing.R
- https://github.com/problemxsolutions/covid19/blob/master/scripts/python/covid_data_processing.py
- https://github.com/problemxsolutions/covid19/blob/master/scripts/julia/covid_data_processing.jl