Converting R scripts to Julia (Part 2)

As part of my Getting COVID-19 Data posts in R, Python and Julia, I will now advance to part two of the conversion process. As we saw in Part 1 of this post series, we duplicated the R scripts into the language specific script folder and changed the file extensions to the appropriate language. In this post I will demonstrate converting the script from R to Julia. I will describe and compare the functions along the way so if you are coming from the Julia side, you can relate to converting the opposite way, just as the R users can understand the conversion from their perspective.

Packages and Libraries

Packages and libraries provide additional modular functionality on top of each programming language’s foundation. R uses the function library() to load the a package, while Julia uses the import function.

R

For the specific script, I’m loading tidyverse, magrittr and lubridate latter on I also use a function in the package roll. I tend to load up tidyverse and magrittr in my scripts because they load a bunch of functional packages. I could easily load just dplyr and readr for this script, but tidyverse offers more convenience to load multiple useful pacakges. magrittr provides pipe support within the tidyverse ecosystem of packages and function. lubridate provides a wide range of functions to work with and parse date/time fields.

library(tidyverse)
library(magrittr)
library(lubridate)

R documentation on library (3.6)

Julia

For the Julia version of the script, I load up the different packages with the using keyword that operates similar to the library function. In Julia, packages a more modular than either R or Python. Alternatively I could use the import keyword which functionally operates like the Python import keyword an requires dot notation to access the module’s functions.

using CSVFiles # v0.16.1
using DataFrames # v0.21.8
using Dates
using Chain  # v0.4.2

Julia documentation on using, import (Julia v1.0+)

Punctuation

Before getting into the code, I want to briefly cover punctuation usages in both languages.

R

In R, you will notice the use of <-. What this means is assigns. It assigns whatever is put on the right of the punctuation to the object on the left. You could also use the function assign which takes arguments for the variable name and the value. You can also use the equals sign, =, in place of the <-.

The next common punctuation you will see in my scripts is %>%, which is a forward pipe. It takes the object on the left and pipes it (feeds into) a function on the right. It is equivalent of the f(x) function notation to the x %>% f. The beauty of pipes is we can continue to build a longer pipe of processes feeding inputs to more and more functions. You will see how I use it in my code below as I chain together multiple data processing and manipulations.

The next is the combination of the pipe and the assignment punctuation, %<>%. What this does is it pipes the variable input on the left to the function on the right then assigns the output back to the input variable.

variable_a <- 4 # assign 4 to variable_a
variable_a = 4 # assign 4 to variable_a
# assign("variable_a", 4) assign 4 to variable_a
# all three methods are equivalent

variable_a %>% sqrt # pipe variable_a to the function sqrt (square root)

variable_a %<>% sqrt # pipe variable_a to the function sqrt (square root) then assign the result back to the input variable_a

R documentation on magrittr (2.01.9000).

Julia

In Julia the assignment operator is just the equals sign, =. You may see the exclamation point ! following a function (i.e. select!()), which commits the operation to the object in-placerather than relying on an explicit variable = assignment. You can use the |> punctuation as a a forward pipe. Julia also uses a dot following a function (i.e f.()) to perform element-wise operations (called broadcasting)

When looking at Julia code it will look almost like a hybrid between R and Python, which is somewhat by design.

base_url = "https://api.covidtracking.com" # assignment

select!(df, ([:State, :Pop, :density]))

load(string(base_dir, filename)) |> DataFrame # pipe

strip.(df.state) # broadcast strip function on state field in DataFrame df to remove whitespace

Julia documentation on punctuation, piping and broadcasting (Julia v1.0+)

Downloading Files

In each language there are several ways to get data from a URL. For each language I’ll demonstrate the method I use and show an alternative as well.

R

In R, I prepared the process by assigns components of the URLs to variables. I did the same for the filename as well. Since there are multiple data sources at the base URL, I specify each unique ending separately. I only use one of them in my script though. The others are for the reader.

I used the download.file function to download the desired data file to my local computer. I prefer to have the source file locally rather than constantly downloading the same information while developing code. Alternatively you could read the data file at the source URL straight into a variable and skip the saving of the original file locally. This makes sense in an operational environment to be able to pull directly from the source whenever the script is run.

base_url <- 'https://api.covidtracking.com'

all_states <- '/v1/states/daily.csv'
state_specific_api <- '/v1/states/{state}/daily.csv'

current_states <- '/v1/states/current.csv'
current_state <- '/v1/states/{state}/current.csv'

filename <- paste0('./data/all_states_daily_covid.csv')
download.file(url = paste0(base_url, all_states), 
              destfile = filename)

# Alternative method to read the data straight to a variable
alt_method_data <- read_csv(file = paste0(base_url, all_states))

R documentation on download.file (base utility functionality) and read_csv (readr v1.6)

Julia

In Julia, I used the download function from the base package. This function works like the version in R where you provide the URL and file destination to save the data.

Alternatively you could use the combination of HTTP.get and CSV.Rows to read the file directly from the URL as was explained in the R section above. Piping the output to DataFrame converts structure to the desired object structure consistent with the remainder of the script.

base_url = "https://api.covidtracking.com"

all_states = "/v1/states/daily.csv"
state_specific_api = "/v1/states/{state}/daily.csv"

current_states = "/v1/states/current.csv"
current_state = "/v1/states/{state}/current.csv"

base_dir = "/home/linux/ProblemXSolutions.com/DataProjects/covid19"
filename = "/data/all_states_daily_covid_jl.csv"

url = string(base_url, all_states)
download(url, string(base_dir, filename));

# Alternative method to read the data straight to a variable
import HTTP, CSV
url = string(base_url, all_states)
df = HTTP.get(url).body |> CSV.Rows |> DataFrame

Julia documentation on download (base), HTTP.get (HTTP package) and CSV.Rows (CSV package)

Reading/Writing CSV Files

The ability to read and write data is a very common and important task throughout the analytics/data science processes. In the subsections below I will show how to perform the operations for tabular data.

R

To read and write data to a csv file, we can use the readr package in the tidyverse package ecosystem. The function a easy to use and can be lazily applied (ie minimal arguments specified). The following snippets show both operations as demonstrated during my script.

In the first line, I read in data with the read_csv function. I just need to reference the file location. I could define a lot more like delimiters, column headings, datatypes, skip lines, missing values and reading in a subset of the data.

In the second function, write_csv, I can save a data object to the designated path.

state_pop_data <- read_csv(file = './data/state_populations.csv')

write_csv(file = './data/state_data_enhanced.csv', 
          x = state_data_enhanced)

R documentation on read_csv and write_csv can be found in the readr package. Good cheatsheet.

Julia

The following are the 1:1 equivalents to the R versions and function about the same way. Both load and save functions are in the CSVFiles package and operate on tabular data. Each function also has about the same optional function arguments to tailor how to read in the data and how to write out data.

state_pop_data = load(string(base_dir,"/data/state_populations.csv")) |> DataFrame

save(string(base_dir, "/data/state_data_enhanced_jl.csv"), state_data_enhanced)

# Alternative method using CSV package instead of CSVFiles
Import CSV
CSV.write(string(base_dir, "/data/state_enhanced_reduced_jl_mod1.csv"),
     state_enhanced_reduced);

Julia documentation on load and save can be found in the CSVFiles package. Alternative CSV package.

Exploring the Data

Although I don’t show the data exploration process in the scripts, I’ll just briefly describe some of the basic commands that you can run

R

In the first line I get a view of the first few lines of the data.frame/tibble by default, though I can specify a number to return that many number of rows. The function head() and tail() operate the same, with one providing a view of the top and the other of the bottom of the data.frame/tibble. The next line provides a summary of the structure of the object using the str() function. The final line will generate summary statistics of each column or provide basic info about non numeric columns.

data %>% head() # or tail()
data %>% str()
data %>% summary()

R documentation on head()/tail(), str() and summary() are part of the base and utility functions.

Julia

In the first line I get a view of the first line of the DataFrame by default, though I can specify a number to return that many number of rows. The functions first and last operate similarly. Without specifying a number, the response will only return 1 row. You need to specify the number of rows following the name of the DataFrame. These functions do not support piping when wanting to get more than 1 row. The function head() and tail() area available but there is a message regarding there deprecation.

The next line provides a summary of the structure providing only the number of rows and columns in the DataFrame. The final line will generate a summary of the data, providing basic statistics info of each column or provide basic info about non numeric columns.

data |> first # will only print first row.  head() is deprecated
first(data, 5) # will print first 5 rows.
data |> tail # will only print first row.  tail() is deprecated
last(data, 5) # will print last 5 rows.

data |> summary # just gives table dimensions (row, columns)
data |> describe

Julia documentation on first()/last() and describe() can be found in the DataFrames package.

Piping Operations

As mentioned above the use of pipes to direct inputs and outputs is a style preference. In R, I use pipes and place each data operation on its own line. it makes the code clean and allows me to chain operations into sets of operations. I have operated in both styles, with the same end results.

R

As mentioned in the punctuation section, in R we can use the %>% to denote a forward piping. The value or object on the left gets piped into the function on the right of the punctuation. If we want to assign the final output back to the original input we can use %<>% to accomplish that. A lot of times, I will test code with the standard pipe and when I’m satisfied with the end result, I modify the first pipe to %<>%.

In the code snippet below, you can see each use of the forward pipe, %>%. Its simple and clean.

state_pop_reduced <- 
   state_pop_data %>% 
   select(State, Pop, density) %>% 
   left_join(x = ., 
             y = state_name_lookup_data, 
             by = c('State' = 'state')) %>% 
   select(-1) %>% 
   rename(state = state_abr)

R documentation on magrittr (2.01.9000).

Julia

In Julia there are a couple different options when it comes to piping operations. As mentioned in the punctuation section above, we can use the forward pipe|>. There is a particularly handy package called Chains that provides a clean approach, especially when operating on DataFrames.

The following code in Julia accomplishes the same set of operations as above. After mentioning @chain we state the input DataFrame then everything between the begin and end keywords denote each successive operation to perform. It looks every similar to R, minus the forward pipe punctuation.

state_pop_reduced =
     @chain state_pop_reduced begin
         leftjoin(
             state_name_lookup_data,
             on =:state)
         select(Not([:state]))
         rename(Dict(:state_abr => :state))
     end

For a solid explanation on piping and variations in Julia, check out this link. Julia documentation for the Chain package.

Select Statements

Being able to select or drop columns from a table structure is quite handy in sub-setting data. The following examples will demonstrate each operation.

R

The following code snippet shows selecting multiple columns by name to subset my piped in data. In the second part, I demonstrate dropping a column using a column index rather than by name. In this instance the resultant data.frame would only have 2 columns, “Pop” and “density”.

new_df <- 
   state_pop_data %>% 
   select(State, Pop, density)

# Drop a column using column index
new_df %>% 
   select(-1)

R Documentation on select function from the dplyr package.

Julia

The following Julia code conducts the same select operation as above. The desired columns are called by name or index number in the list. Julia using a different syntax for referring to column names with the appended colon, :columnname, in front of the column name. To drop a column(s), we can use the Not() function and specific which columns should not be selected, i.e. dropped.

select(state_pop_data, ([:State, :Pop, :density]))

# Commits the select statement on the DataFram provided, in-place
select!(state_pop_data, ([:State, :Pop, :density]))


# Drop column(s)
select(state_pop_reduced, Not([:state]))

# Commits Column Drop in-place
select!(state_pop_reduced, Not([:state]))

Julia documentation on select/select!, drop in the DataFrames package

DataFrame Joins

Joining tables and DataFrames is a powerful operation. In order to join or merge these data objects, we need to have a common key in each prior to. In the examples below, I have used a left join operation to merge the data from my state_name_lookup_data table (right or y) to my state_pop_data table (left or x). What this operation will do is keep all the records from the left table and join the table on the right only on the values that can be joined on. I will cover the different join operations in a separate post.

R

After piping the state_pop_data through the select statement, it then flows into the leftjoin function. I join the two tables on their common fields, which is “State” on the left and “state” on the right. Had the column names been the exact same I would just need to reference one. I could rename either of the columns to conform to the other as well. Unlike Julia, the join column on the right is not kept following the join. Following the join operation, I decide to drop the first column since it was not needed going forward. I also modify the one of the new column names to take on the name of the field I just dropped.

state_pop_reduced <- 
   state_pop_data %>% 
   select(State, Pop, density) %>% 
   left_join(x = ., 
             y = state_name_lookup_data, 
             by = c('State' = 'state')) %>% 
   select(-1) %>% 
   rename(state = state_abr)

R documentation on x_join (dplyr).

Julia

The same comments apply as in the above R section. Julia has the same joins that R does and they are structure similarly. In the code below, I had actually changed the column names to be the same so it looks different. The the columns names differed we would use :col_df1 => :col_df2 in place of the single column referenced below. To demonstrate the same tail end operations as above, you can see the dropping of the unneeded columns and renaming a column.

state_pop_reduced =
     @chain state_pop_reduced begin
         leftjoin(
             state_name_lookup_data,
             on =:state)
         select(Not([:state]))
         rename(Dict(:state_abr => :state))
     end

Julia documentation on join (DataFrames).

Renaming Columns

Renaming an existing column is easy in both language. The function in both is the same as well. There a a number of different ways you can accomplish the task, I am showing one way of doing it, the way I used in my script.

R

In R we can rename any column in the pattern of “new_column_name” = “old_column_name”.

state_pop_reduced <- 
   state_pop_data %>% 
   select(State, Pop, density) %>% 
   left_join(x = ., 
             y = state_name_lookup_data, 
             by = c('State' = 'state')) %>% 
   select(-1) %>% 
   rename(state = state_abr)

R documentation on rename (dplyr).

Julia

In Julia we can rename any column in the pattern of :old_column_name => :new_column_name. The punctuation rename! determines whether the change is made in place or not.

state_pop_reduced =
     @chain state_pop_reduced begin
         leftjoin(
             state_name_lookup_data,
             on =:state)
         select(Not([:state]))
         rename(Dict(:state_abr => :state))
     end

# Alternatively, in-place name change
rename!(df, Dict(:old_colname => :new_colname))

Julia documentation on rename/ rename! (DataFrames).

Sorting/Ordering DataFrames

The next operation is sorting your DataFrame. In SQL you will be used to the Order By command to sort data in ascending or descending order by column. You can perform the same tasks in R and Julia as shown below.

R

In R , we can sort our data.frame using the arrange function. Each column that controls the sorting is arranged in the order the columns appear in the function arguments. If you want a column to sort in descending order you need to place a desc() function around the column name, otherwise it sorts in ascending order, just like SQL.

state_data %>% 
   arrange(state, date)

R documentation on arrange (dplyr).

Julia

In Julia, we use the sort function. We specify the columns and order by which columns should control the sorting. To control the sort direction of each column, you need to use the order function on each specific column and provide the boolean value to the rev (reverse) argument. When you use this argument the size of the list should match the number of columns in the by list.

sort(state_data_enhanced, [:state,:date])
sort(state_data_enhanced, [:state,order(:date, rev=true)])

# Commit In-place
sort(!state_data_enhanced, [:state,:date])

Julia documentation on sort/sort!, order (DataFrames).

Creating New Columns with Column Calculations

Enhancing our data is an important task. If the original data structure does not have a field or has multiple fields that could be reduced to one, we can accomplish this by creating or adding structure from calculations or logic statements. The join operation is a separate utility to enhance data as well, but the following operations allow you to create what you need from what you already have.

R

In R we can use the mutate command to add structure through calculations or logic statements against the existing structure. My example script uses it in multiple instances. In the first instance of mutate I calculate the difference between row values. In the second instance I create additional fields based on calculations to determine the adjusted variations of the input fields, as well as 7-day moving averages. The third instance, uses newly created fields from the second instance to create additional adjusted valued fields.

The reason for separating the instance is two-fold. First, it organizes the codes. Second, if I wish to perform an operation on a field, I just created I need to complete the operation before having access to is, otherwise I need to perform the same field calculation twice.

... %>% 
   mutate(daily_recover = recovered - lag(recovered, default = first(recovered))) %>%
   mutate(daily_cases_adj = (positiveIncrease / Pop) * 100000,
          daily_recover_adj = (daily_recover / Pop) * 100000,
          daily_deaths_adj = (deathIncrease / Pop) * 100000,
          active_roll7 = zoo::rollmean(positiveIncrease, k = 7, fill = NA),
          recovered_roll7 = zoo::rollmean(daily_recover_adj, k = 7, fill = NA),
          deaths_roll7 = zoo::rollmean(deathIncrease, k = 7, fill = NA)) %>% 
   mutate(active_roll7_adj = zoo::rollmean(daily_cases_adj, k = 7, fill = NA),
          recovered_roll7_adj = zoo::rollmean(daily_recover_adj, k = 7, fill = NA),
          deaths_roll7_adj = zoo::rollmean(daily_deaths_adj, k = 7, fill = NA))

R documentation on mutate (dplyr).

Julia

In the Julia version , it looks almost exactly the same as R and less like Python. Using the transform function, I can perform the same tasks. Since I didn’t have readily available functions to calculate the difference between row values or the moving average, I create the functions, as well as the population adjust one too. The structure of applying the functions is [:col1,...] => (x, ... -> function(x,...)) => :new_colname. The (x, ... -> function(x,...)) is creating anonymous function that provides you with more control over arguments. You see see that portion in the post or script on GitHub.

As you can see I structured my transform statements in the same way I did the mutate statements in R.

state_data_enhanced =
     @chain state_data_enhanced begin
         groupby(:state)
         transform(:recovered => (x -> deltas(x, 1)) => :daily_recover)
         transform(
             [:positiveIncrease, :Pop] => ByRow(pop_adjusted) => :daily_cases_adj,
             [:daily_recover, :Pop] => ByRow(pop_adjusted) => :daily_recover_adj,
             [:deathIncrease, :Pop] => ByRow(pop_adjusted) => :daily_deaths_adj)
         transform(
             :positiveIncrease => (x -> moving_average(x, 7)) => :active_roll7,
             :daily_recover => (x -> moving_average(x, 7)) => :recovered_roll7,
             :deathIncrease => (x -> moving_average(x, 7)) => :deaths_roll7,
             :daily_cases_adj => (x -> moving_average(x, 7)) => :active_roll7_adj,
             :daily_recover_adj => (x -> moving_average(x, 7)) => :recovered_roll7_adj,
             :daily_deaths_adj => (x -> moving_average(x, 7)) => :deaths_roll7_adj)
     end

Julia documentation on transform (DataFrames).

What’s Next?

In the next follow up post, I will begin to demonstrate creating basic data visualizations from the data produced as a result of the initial processing scripts in both R and Julia.

GitHub Link

For the full scripts for each language referenced here, refer to my GitHub repo and links below.

https://github.com/problemxsolutions/covid19/blob/master/scripts/r/covid_data_processing.R
https://github.com/problemxsolutions/covid19/blob/master/scripts/python/covid_data_processing.py
https://github.com/problemxsolutions/covid19/blob/master/scripts/julia/covid_data_processing.jl

Converting R scripts to Julia (Part 2)

Packages and Libraries

R

Julia

Punctuation

R

Julia

Downloading Files

R

Julia

Reading/Writing CSV Files

R

Julia

Exploring the Data

R

Julia

Piping Operations

R

Julia

Select Statements

R

Julia

DataFrame Joins

R

Julia

Renaming Columns

R

Julia

Sorting/Ordering DataFrames

R

Julia

Creating New Columns with Column Calculations

R

Julia

What’s Next?

GitHub Link

R Books

Julia Books