As part of my Getting COVID-19 Data posts in R, Python and Julia, I will now advance to part two of the conversion process. As we saw in Part 1 of this post series, we duplicated the R scripts into the language specific script folder and changed the file extensions to the appropriate language. In this post I will demonstrate converting the script from R to Python. I will describe and compare the functions along the way so if you are coming from the Python side, you can relate to converting the opposite way, just as the R users can understand the conversion from their perspective.
Packages and Libraries
Packages and libraries provide additional modular functionality on top of each programming language’s foundation. R uses the function library()
to load the a package, while Python uses the import
function.
R
For the specific script, I’m loading tidyverse
, magrittr
and lubridate
latter on I also use a function in the package roll
. I tend to load up tidyverse
and magrittr
in my scripts because they load a bunch of functional packages. I could easily load just dplyr
and readr
for this script, but tidyverse
offers more convenience to load multiple useful pacakges. magrittr
provides pipe support within the tidyverse ecosystem of packages and function. lubridate
provides a wide range of functions to work with and parse date/time fields.
library(tidyverse)
library(magrittr)
library(lubridate)
R documentation on library (3.6)
Python
For the Python version of the script, I load up the urllib.request
and pandas
packages. pandas
is on par with the standard data.frame object structure in R and the functionality of the dplyr
, readr
, tibble
, stringr
and tidyr
packages. In nice feature with Python is the ability to give packages an alias to shorten the way we can refer to it.
import urllib.request
import pandas as pd
Python documentation on import (python3.8)
Punctuation
Before getting into the code, I want to briefly cover punctuation usages in both languages.
R
In R, you will notice the use of <-
. What this means is assigns. It assigns whatever is put on the right of the punctuation to the object on the left. You could also use the function assign
which takes arguments for the variable name and the value. You can also use the equals sign, =
, in place of the <-
.
The next common punctuation you will see in my scripts is %>%
, which is a forward pipe. It takes the object on the left and pipes it (feeds into) a function on the right. It is equivalent of the f(x) function notation to the x %>% f
. The beauty of pipes is we can continue to build a longer pipe of processes feeding inputs to more and more functions. You will see how I use it in my code below as I chain together multiple data processing and manipulations.
The next is the combination of the pipe and the assignment punctuation, %<>%. What this does is it pipes the variable input on the left to the function on the right then assigns the output back to the input variable.
variable_a <- 4 # assign 4 to variable_a
variable_a = 4 # assign 4 to variable_a
# assign("variable_a", 4) assign 4 to variable_a
# all three methods are equivalent
variable_a %>% sqrt # pipe variable_a to the function sqrt (square root)
variable_a %<>% sqrt # pipe variable_a to the function sqrt (square root) then assign the result back to the input variable_a
R documentation on magrittr
(2.01.9000).
Python
In python the assignment operator is just eh equals sign, =
. You can use the . I have not explicitly seen a piping punctuation readily available in Python. The closest concept to piping I implemented was to enter a return before each .function
.
In the code below you can see that I entered a return on .sort_values
, .assign
. It does help to breakup the code but not as clean as you will see in R or Julia. As a result you will see often more lines of code in comparison to each of my converted scripts.
state_data_enhanced = \
state_data_enhanced \
.sort_values(by=['state', 'date']) \
.assign(
daily_recover = state_data_enhanced.groupby('state')['recovered'].transform('diff')
) \
.assign(
daily_cases_adj = lambda x: (x['positiveIncrease'] / x['Pop']) * 100000,
daily_recover_adj = lambda x: (x['daily_recover'] / x['Pop']) * 100000,
daily_deaths_adj = lambda x: (x['deathIncrease'] / x['Pop']) * 100000
)
Downloading Files
In each language there are several ways to get data from a URL. For each language I’ll demonstrate the method I use and show an alternative as well.
R
In R, I prepared the process by assigns components of the URLs to variables. I did the same for the filename as well. Since there are multiple data sources at the base URL, I specify each unique ending separately. I only use one of them in my script though. The others are for the reader.
I used the download.file
function to download the desired data file to my local computer. I prefer to have the source file locally rather than constantly downloading the same information while developing code. Alternatively you could read the data file at the source URL straight into a variable and skip the saving of the original file locally. This makes sense in an operational environment to be able to pull directly from the source whenever the script is run.
base_url <- 'https://api.covidtracking.com'
all_states <- '/v1/states/daily.csv'
state_specific_api <- '/v1/states/{state}/daily.csv'
current_states <- '/v1/states/current.csv'
current_state <- '/v1/states/{state}/current.csv'
filename <- paste0('./data/all_states_daily_covid.csv')
download.file(url = paste0(base_url, all_states),
destfile = filename)
# Alternative method to read the data straight to a variable
alt_method_data <- read_csv(file = paste0(base_url, all_states))
R documentation on download.file
(base utility functionality) and read_csv
(readr v1.6)
Python
In Python, I used the urllib.request
package with the specific urlretrieve
function. The official documentation indicates that his method may become deprecated in the future. The new methodology would be to use the urlopen
, read()
and pandas.to_csv
functions.
Alternatively you could use pd.read_csv
to read the file directly from the URL as was explained in the R section above.
base_url = 'https://api.covidtracking.com'
all_states = '/v1/states/daily.csv'
current_states = '/v1/states/current.csv'
state_specific_api = '/v1/states/{state}/daily.csv'
current_state = '/v1/states/{state}/current.csv'
base_dir = '/home/linux/ProblemXSolutions.com/DataProjects/covid19'
filename = '/data/all_states_daily_covid_py.csv'
urllib.request.urlretrieve(url=base_url + all_states,
filename=base_dir + filename)
# Alternative method to read the data straight to a variable
alt_method_data = pd.read_csv(base_url + all_states)
Python documentation on urllib.request
(python3) and pd.read_csv
(pandas)
Reading/Writing CSV Files
The ability to read and write data is a very common and important task throughout the analytics/data science processes. In the subsections below I will show how to perform the operations for tabular data.
R
To read and write data to a csv file, we can use the readr
package in the tidyverse
package ecosystem. The function a easy to use and can be lazily applied (ie minimal arguments specified). The following snippets show both operations as demonstrated during my script.
In the first line, I read in data with the read_csv
function. I just need to reference the file location. I could define a lot more like delimiters, column headings, datatypes, skip lines, missing values and reading in a subset of the data.
In the second function, write_csv
, I can save a data object to the designated path.
state_pop_data <- read_csv(file = './data/state_populations.csv')
write_csv(file = './data/state_data_enhanced.csv',
x = state_data_enhanced)
R documentation on read_csv and write_csv can be found in the readr package. Good cheatsheet.
Python
The following are the 1:1 equivalents to the R versions and function about the same way. Both read_csv
and .to_csv
are in the pandas
package and operate on tabular data. Each function also has about the same optional function arguments to tailor how to read in the data and how to write out data.
state_data = pd.read_csv(base_dir + filename, low_memory=False)
state_data_enhanced.to_csv(base_dir + '/data/state_data_enhanced_py.csv')
Python documentation on read_csv
and to_csv
can be found in the pandas package.
Exploring the Data
Although I don’t show the data exploration process in the scripts, I’ll just briefly describe some of the basic commands that you can run
R
In the first line I get a view of the first few lines of the data.frame/tibble by default, though I can specify a number to return that many number of rows. The function head()
and tail()
operate the same, with one providing a view of the top and the other of the bottom of the data.frame/tibble. The next line provides a summary of the structure of the object using the str()
function. The final line will generate summary statistics of each column or provide basic info about non numeric columns.
data %>% head() # or tail()
data %>% str()
data %>% summary()
R documentation on head()
/tail()
, str()
and summary()
are part of the base and utility functions.
Python
In the first line I get a view of the first few lines of the DataFrame by default, though I can specify a number to return that many number of rows. The function head()
and tail()
operate the same, with one providing a view of the top and the other of the bottom of the DataFrame. The next line provides a summary of the structures datatype of the object using the info()
or dtypes
functions. The final line will generate summary statistics of each column or provide basic info about non numeric columns.
data.head() # .tail()
data.info() #or data.dtypes
data.describe()
Python documentation on head()
/tail()
, info()
, dtypes()
and describe()
can be found in the pandas
package.
Piping Operations
As mentioned above the use of pipes to direct inputs and outputs is a style preference. In R, I use pipes and place each data operation on its own line. it makes the code clean and allows me to chain operations into sets. I have operated in both styles, with the same end results.
R
As mentioned in the punctuation section, in R we can use the %>%
to denote a forward piping. The value or object on the left gets piped into the function on the right of the punctuation. If we want to assign the final output back to the original input we can use %<>%
to accomplish that. A lot of times, I will test code with the standard pipe and when I’m satisfied with the end result, I modify the first pipe to %<>%
.
In the code snippet below, you can see each use of the forward pipe, %>%
. Its simple and clean.
state_pop_reduced <-
state_pop_data %>%
select(State, Pop, density) %>%
left_join(x = .,
y = state_name_lookup_data,
by = c('State' = 'state')) %>%
select(-1) %>%
rename(state = state_abr)
R documentation on magrittr
(2.01.9000).
Python
The following code in Python accomplishes the same set of operations as above. Since I don’t have piping punctuation, it looks like the other code in the script.
state_pop_reduced = state_pop_data.filter(items=['State', 'Pop', 'density'])
# had to insert this since python merge or read_csv doesn't work the same as in R versions
state_pop_reduced.loc[:, 'State'] = state_pop_reduced.loc[:, 'State'].str.strip()
state_name_lookup_data.loc[:, 'state'] = state_name_lookup_data.loc[:, 'state'].str.strip()
state_pop_reduced = pd.merge(
left=state_pop_reduced,
right=pd.DataFrame(state_name_lookup_data),
how='left',
left_on='State',
right_on='state')
Select Statements
Being able to select or drop columns from a table structure is quite handy in sub-setting data. The following examples will demonstrate each operation.
R
The following code snippet shows selecting multiple columns by name to subset my piped in data. In the second part, I demonstrate dropping a column using a column index rather than by name. In this instance the resultant data.frame would only have 2 columns, “Pop” and “density”.
new_df <-
state_pop_data %>%
select(State, Pop, density)
# Drop a column using column index
new_df %>%
select(-1)
R Documentation on select
function from the dplyr
package.
Python
The following Python code conducts the same select operation as above. The desired columns are called by name or index number in the list.
state_data_enhanced[['State', 'Pop', 'density']]
# or
state_pop_data.filter(items=['State', 'Pop', 'density'])
# Drop column(s)
state_pop_reduced.drop(columns=['State', 'state'],
inplace=True)
Python documentation on selection, filter
, drop
in the pandas
package
DataFrame Joins
Joining tables and DataFrames is a powerful operation. In order to join or merge these data objects, we need to have a common key in each prior to. In the examples below, I have used a left join operation to merge the data from my state_name_lookup_data
table (right or y) to my state_pop_data
table (left or x). What this operation will do is keep all the records from the left table and join the table on the right only on the values that can be joined on. I will cover the different join operations in a separate post.
R
After piping the state_pop_data
through the select statement, it then flows into the leftjoin
function. I join the two tables on their common fields, which is “State” on the left and “state” on the right. Had the column names been the exact same I would just need to reference one. I could rename either of the columns to conform to the other as well. Unlike Python, the join column on the right is not kept following the join. Following the join operation, I decide to drop the first column since it was not needed going forward. I also modify the one of the new column names to take on the name of the field I just dropped.
state_pop_reduced <-
state_pop_data %>%
select(State, Pop, density) %>%
left_join(x = .,
y = state_name_lookup_data,
by = c('State' = 'state')) %>%
select(-1) %>%
rename(state = state_abr)
R documentation on x_join
(dplyr
).
Python
The same comments apply as in the above R section. There is a difference in argument naming but the concepts are the same. In Python they use the pd.merge
function to serve as their join command for their DataFrames. We call the “how” argument to specify the type of join operation to conduct. To demonstrate the same tail end operations as above, you can see the dropping of the unneeded columns and renaming a column.
state_pop_reduced = pd.merge(
left=state_pop_reduced,
right=pd.DataFrame(state_name_lookup_data),
how='left',
left_on='State',
right_on='state')
state_pop_reduced.drop(columns=['State', 'state'],
inplace=True)
state_pop_reduced.rename(columns={'state_abr':'state'},
inplace=True)
Python documentation on merge
(pandas
).
Renaming Columns
Renaming an existing column is easy in both language. The function in both is the same as well. There a a number of different ways you can accomplish the task, I am showing one way of doing it, the way I used in my script.
R
In R we can rename any column in the pattern of “new_column_name” = “old_column_name”.
state_pop_reduced <-
state_pop_data %>%
select(State, Pop, density) %>%
left_join(x = .,
y = state_name_lookup_data,
by = c('State' = 'state')) %>%
select(-1) %>%
rename(state = state_abr)
R documentation on rename
(dplyr
).
Python
In Python we can rename any column in the pattern of “old_column_name”:”new_column_name”. The argument inplace
signifies that the change is made to the DataFrame in the original reference. If the argument is passed as false the change will not be permanent.
state_pop_reduced.rename(columns={'state_abr':'state'},
inplace=True)
Python documentation on rename
(pandas
).
Sorting/Ordering DataFrames
The next operation is sorting your DataFrame. In SQL you will be used to the Order By command to sort data in ascending or descending order by column. You can perform the same tasks in R and python as shown below.
R
In R , we can sort our data.frame using the arrange
function. Each column that controls the sorting is arranged in the order the columns appear in the function arguments. If you want a column to sort in descending order you need to place a desc()
function around the column name, otherwise it sorts in ascending order, just like SQL.
state_data %>%
arrange(state, date)
R documentation on arrange
(dplyr
).
Python
In Python, we used the sort_values
function. We specify the columns and order by which columns should control the sorting. To control the sort direction of each column, you need to use the ascending
argument and specify each column as a boolean value (true = ascending, false = descending) in a list. When you use this argument the size of the list should match the number of columns in the by
list.
state_data_enhanced \
.sort_values(by=['state', 'date'])
Python documentation on sort_values
(pandas
).
Creating New Columns with Column Calculations
Enhancing our data is an important task. If the original data structure does not have a field or has multiple fields that could be reduced to one, we can accomplish this by creating or adding structure from calculations or logic statements. The join operation is a separate utility to enhance data as well, but the following operations allow you to create what you need from what you already have.
R
In R we can use the mutate
command to add structure through calculations or logic statements against the existing structure. My example script uses it in multiple instances. In the first instance of mutate
I calculate the difference between row values. In the second instance I create additional fields based on calculations to determine the adjusted variations of the input fields, as well as 7-day moving averages. The third instance, uses newly created fields from the second instance to create additional adjusted valued fields.
The reason for separating the instance is two-fold. First, it organizes the codes. Second, if I wish to perform an operation on a field, I just created I need to complete the operation before having access to is, otherwise I need to perform the same field calculation twice.
... %>%
mutate(daily_recover = recovered - lag(recovered, default = first(recovered))) %>%
mutate(daily_cases_adj = (positiveIncrease / Pop) * 100000,
daily_recover_adj = (daily_recover / Pop) * 100000,
daily_deaths_adj = (deathIncrease / Pop) * 100000,
active_roll7 = zoo::rollmean(positiveIncrease, k = 7, fill = NA),
recovered_roll7 = zoo::rollmean(daily_recover_adj, k = 7, fill = NA),
deaths_roll7 = zoo::rollmean(deathIncrease, k = 7, fill = NA)) %>%
mutate(active_roll7_adj = zoo::rollmean(daily_cases_adj, k = 7, fill = NA),
recovered_roll7_adj = zoo::rollmean(daily_recover_adj, k = 7, fill = NA),
deaths_roll7_adj = zoo::rollmean(daily_deaths_adj, k = 7, fill = NA))
R documentation on mutate
(dplyr
).
Python
In the Python version , I had to split the operations into the two sets of operations. Using the assign
function, I can perform the same tasks, especially in conjunction with the lambda
function. As you can see I structured my assign
statements in the same way I did in R, except the second instance I had to break apart. This break created the second set of operations. Python did not seem to like performing calculations on newly created fields in the previous assign
statement in the same block. I’m sure there are pros/ and cons to either method (R vs Python)
state_data_enhanced = \
state_data_enhanced \
.sort_values(by=['state', 'date']) \
.assign(
daily_recover = state_data_enhanced.groupby('state')['recovered'].transform('diff')
) \
.assign(
daily_cases_adj = lambda x: (x['positiveIncrease'] / x['Pop']) * 100000,
daily_recover_adj = lambda x: (x['daily_recover'] / x['Pop']) * 100000,
daily_deaths_adj = lambda x: (x['deathIncrease'] / x['Pop']) * 100000
)
state_data_enhanced = \
state_data_enhanced \
.assign(active_roll7 = state_data_enhanced.groupby('state')['positiveIncrease']
.transform(lambda x: x.rolling(window=7, min_periods=1).mean()),
recovered_roll7 = state_data_enhanced.groupby('state')['daily_recover_adj']
.transform(lambda x: x.rolling(window=7, min_periods=1).mean()),
deaths_roll7 = state_data_enhanced.groupby('state')['deathIncrease']
.transform(lambda x: x.rolling(window=7, min_periods=1).mean())
) \
.assign(active_roll7_adj = state_data_enhanced.groupby('state')['daily_cases_adj']
.transform(lambda x: x.rolling(window=7, min_periods=1).mean()),
recovered_roll7_adj = state_data_enhanced.groupby('state')['daily_recover_adj']
.transform(lambda x: x.rolling(window=7, min_periods=1).mean()),
deaths_roll7_adj = state_data_enhanced.groupby('state')['daily_deaths_adj']
.transform(lambda x: x.rolling(window=7, min_periods=1).mean())
)
Python documentation on assign
and transform
from (pandas
) and lambda
.
What’s Next?
In the next follow up post, I will begin to demonstrate creating basic data visualizations from the data produced as a result of the initial processing scripts in both R and Python.
GitHub Link
For the full scripts for each language referenced here, refer to my GitHub repo and links below.
- https://github.com/problemxsolutions/covid19/blob/master/scripts/r/covid_data_processing.R
- https://github.com/problemxsolutions/covid19/blob/master/scripts/python/covid_data_processing.py
- https://github.com/problemxsolutions/covid19/blob/master/scripts/julia/covid_data_processing.jl