As part of converting my Criminal Analysis Data Project code from R to Julia, I thought I would create a series of small posts detailing components of the translation process of data operations in smaller bits. This particular post will show a solution for how to take tabular data from a CSV and load it into a PostgreSQL database.
I will show both methods (R and Julia) for accomplishing the same task so this will be good regardless of needing to translate or just needing to know the process in your desired language.
For additional information on script conversions for R to Julia, please reference the list of posts at the bottom of this page.
Language and Package Versions
I am currently using the following versions of R, Julia and their packages.
R
# R v4.0.3
library(readr) # v1.4.0
library(DBI) # v1.1.1
library(RPostgres) # v1.3.1
Julia
# Julia v1.5.3
using LibPQ # v1.6.2
using CSVFiles # v0.16.2
using DataFrames # v0.21.8
Connecting to PostgreSQL
To conduct operations in PostgreSQL from either language, we need to establish a connection. The following details my user defined function implemented in my data project.
R
I created the following function to simplify my data project efforts to easily source the function script and access what I need a little easier than writing everything in each project related script. Here, I’m just defining the connection parameters to access the database with a username/password.
pg_connect <- function(db = 'project_crime',
host_db = 'localhost',
db_port = '5432',
db_user = 'analyst',
db_password = 'mypassword'){
return(
DBI::dbConnect(RPostgres::Postgres(),
dbname = db,
host=host_db,
port=db_port,
user=db_user,
password=db_password)
)
}
I have this function and appropriate documentation in a separate script but can call it from within my other scripts like this (available on GitHub):
source('~/ProblemXSolutions.com/DataProjects/DC_Crime/project_crime/scripts_r/project_functions_db.R')
Julia
In Julia, I used the same structure as my R version to create a parallel across my data project language translation process. I’m focused on functionality first, then I can optimize. There may be a better method of implementation, but this works just fine.
function pg_connect(;db = "project_crime",
host_db = "localhost",
db_port = "5432",
db_user = "analyst",
db_password = "mypassword")
conn_string = string(
"dbname=$db",
" host=$host_db",
" port=$db_port",
" user=$db_user",
" password=$db_password",
" connect_timeout=10"
)
return LibPQ.Connection( conn_string )
end
I have this function and appropriate documentation in a separate script. I can call it from within my other scripts like this.
include("$base_dir/project_crime/scripts_julia/project_functions_db.jl")
This associated script has not been pushed to GitHub at the time of this writing. Check the GitHub project to see if the folder “scripts_julia” is present. If it is, look for the script as referenced in the block above.
Bulk Loading CSV to PostgreSQL
Creating a table and loading data to that table is an important process. Below are the methods for each language to read the CSV file, create the table and load the data. To conclude, I will also query the data to validate the results.
R
The R version of this process is pretty simple and straight forward. First we source the pg_connect()
function. Next we read in the desired CSV file. In R, the function readr::read_csv
loads the CSV file into a tibble (data frame).
source('~/ProblemXSolutions.com/DataProjects/DC_Crime/project_crime/scripts_r/project_functions_db.R')
wx_data_to_db <- read_csv(file = '../data/weather/wx_data_2009_2019.csv')
dbWriteTable(conn = pg_connect(),
name = "weather",
value = wx_data %>% as.data.frame(.))
dbReadTable(con, 'weather') %>% head()
In R, I don’t have to define the table at creation. The function dbWriteTable
handles that operation for us.
Julia
The equivalent process in Julia is a little more involve since it does not have the packages and functions that R does at the current point in development. We can still achieve the same results though.
Once we load the CSV file, we pipe the data to the DataFrame
function from the DataFrames
package. I have parameterized the process more so it may look more involved in comparison to the R version above.
In Julia, I did have to define and create the table before bulk inserting the data, as you will see in the final portion in the block below.
base_dir = "/home/linux/ProblemXSolutions.com/DataProjects/DC_Crime"
include("$base_dir/project_crime/scripts_julia/project_functions_db.jl")
base_dir_wx = "$base_dir/data/weather"
filename = "$base_dir_wx/full_wx_data_2009_2019.csv"
wx_data_to_db = load(filename) |> DataFrame
tabel_name = "weather_julia_load"
exe_s = string(
"CREATE TABLE ", tabel_name, " (",
"date timestamp,",
"datatype varchar(4),",
"station varchar(17),",
"value int,",
"fl_m varchar(2),",
"fl_q varchar(10),",
"fl_so varchar(1),",
"fl_t int,",
"units varchar(25)",
");")
The next two code blocks perform the tasks accomplished in the DBI::dbWriteTable()
function above in R. In this next block, I execute the SQL statement to create the table.
result = LibPQ.execute(pg_connect(), exe_s)
Once the table has been created, we can bulk insert the DataFrame to the table. Per the LibPQ documentation, you get better performance structuring the process the way I did below. Again the first half of the block is just creating the string parameters, while the latter half actually performs the task of loading the data.
# Define the column names string for the SQL statement
colnames_s = join(map(string, names(wx_data_to_db)), ',')
# Define the column indices for the value portion of the SQL statement
value_s = string("\$", join(map(string, [1:1:ncol(wx_data_to_db);]), ",\$"))
# Bulk Insertion method greatly increases performance
LibPQ.execute(conn, "BEGIN;")
LibPQ.load!(wx_data_to_db,
pg_connect(),
string("INSERT INTO ", tabel_name, " (",colnames_s,") ",
"VALUES (", value_s,");")
)
LibPQ.execute(conn, "COMMIT;")
We can query to see that the table has been populated.
query_result = LibPQ.execute(pg_connect(),
string("SELECT * FROM ", tabel_name, " LIMIT ;"))
query_results = DataFrame(query_result)
print(query_results)
Checking the tables in the database
Checking the two tables in the PostgreSQL database is ease to do. Notice that I named the tables differently to distinguish which came from where. Since R is my default, I don’t designate the language. For my project, I could easily append on “_r”, “_p” or “_j”, to annotate between the scripting languages to validate that they each produce the same results.
select * from weather limit 10;
Now, we query the table created in Julia
select * from weather_julia_load limit 10;
Resultant Julia Function
In light of the Julia process above, I created a function to act as a quasi-dbWriteTable stand in. It certainly has room for lots of improvement. Take it and improve upon this base to suit your needs.
Julia Function dbWriteTable_df
function dbWriteTable_df(;conn=pg_connect(),
df,
tabel_name,
return_head = true)
colnames_s = join(map(string, names(df)), ',')
value_s = string("\$", join(map(string, [1:1:ncol(df);]), ",\$"))
# Bulk Insertion method greatly increases performance
LibPQ.execute(conn, "BEGIN;")
LibPQ.load!(df,
conn,
string("INSERT INTO ", tabel_name, " (",colnames_s,") ",
"VALUES (", value_s,");")
)
LibPQ.execute(conn, "COMMIT;")
if return_head
query_result = LibPQ.execute(conn,
string("SELECT * FROM ", tabel_name, " LIMIT 5;"))
query_results = DataFrame(query_result)
print(query_results)
end
printlin("Table Loading Complete")
end