PostgreSQL Table Creation and Bulk Insertion

As part of converting my Criminal Analysis Data Project code from R to Julia, I thought I would create a series of small posts detailing components of the translation process of data operations in smaller bits. This particular post will show a solution for how to take tabular data from a CSV and load it into a PostgreSQL database.

I will show both methods (R and Julia) for accomplishing the same task so this will be good regardless of needing to translate or just needing to know the process in your desired language.

For additional information on script conversions for R to Julia, please reference the list of posts at the bottom of this page.

Language and Package Versions

I am currently using the following versions of R, Julia and their packages.

R

# R v4.0.3
library(readr) # v1.4.0
library(DBI) # v1.1.1
library(RPostgres) # v1.3.1

Julia

# Julia v1.5.3
using LibPQ # v1.6.2
using CSVFiles # v0.16.2
using DataFrames # v0.21.8

Connecting to PostgreSQL

To conduct operations in PostgreSQL from either language, we need to establish a connection. The following details my user defined function implemented in my data project.

R

I created the following function to simplify my data project efforts to easily source the function script and access what I need a little easier than writing everything in each project related script. Here, I’m just defining the connection parameters to access the database with a username/password.

pg_connect <- function(db = 'project_crime',
                        host_db = 'localhost', 
                        db_port = '5432',
                        db_user = 'analyst',
                        db_password = 'mypassword'){

 return(
     DBI::dbConnect(RPostgres::Postgres(), 
                    dbname = db, 
                    host=host_db,
                    port=db_port,
                    user=db_user, 
                    password=db_password)  
   )  
 }

I have this function and appropriate documentation in a separate script but can call it from within my other scripts like this (available on GitHub):

source('~/ProblemXSolutions.com/DataProjects/DC_Crime/project_crime/scripts_r/project_functions_db.R')

Julia

In Julia, I used the same structure as my R version to create a parallel across my data project language translation process. I’m focused on functionality first, then I can optimize. There may be a better method of implementation, but this works just fine.

function pg_connect(;db = "project_crime",
                     host_db = "localhost",
                     db_port = "5432",
                     db_user = "analyst",
                     db_password = "mypassword")

 conn_string = string(
     "dbname=$db",
     " host=$host_db",
     " port=$db_port",
     " user=$db_user",
     " password=$db_password",
     " connect_timeout=10"
     )
 return LibPQ.Connection( conn_string )
 end

I have this function and appropriate documentation in a separate script. I can call it from within my other scripts like this.

include("$base_dir/project_crime/scripts_julia/project_functions_db.jl")

This associated script has not been pushed to GitHub at the time of this writing. Check the GitHub project to see if the folder “scripts_julia” is present. If it is, look for the script as referenced in the block above.

Bulk Loading CSV to PostgreSQL

Creating a table and loading data to that table is an important process. Below are the methods for each language to read the CSV file, create the table and load the data. To conclude, I will also query the data to validate the results.

R

The R version of this process is pretty simple and straight forward. First we source the pg_connect() function. Next we read in the desired CSV file. In R, the function readr::read_csv loads the CSV file into a tibble (data frame).

source('~/ProblemXSolutions.com/DataProjects/DC_Crime/project_crime/scripts_r/project_functions_db.R')

wx_data_to_db <- read_csv(file = '../data/weather/wx_data_2009_2019.csv') 

dbWriteTable(conn = pg_connect(), 
             name = "weather", 
             value = wx_data %>%  as.data.frame(.))

dbReadTable(con, 'weather') %>% head()

In R, I don’t have to define the table at creation. The function dbWriteTable handles that operation for us.

Julia

The equivalent process in Julia is a little more involve since it does not have the packages and functions that R does at the current point in development. We can still achieve the same results though.

Once we load the CSV file, we pipe the data to the DataFrame function from the DataFrames package. I have parameterized the process more so it may look more involved in comparison to the R version above.

In Julia, I did have to define and create the table before bulk inserting the data, as you will see in the final portion in the block below.

base_dir = "/home/linux/ProblemXSolutions.com/DataProjects/DC_Crime"

include("$base_dir/project_crime/scripts_julia/project_functions_db.jl")
base_dir_wx = "$base_dir/data/weather"

filename = "$base_dir_wx/full_wx_data_2009_2019.csv"
wx_data_to_db = load(filename) |> DataFrame

tabel_name = "weather_julia_load"

exe_s = string(
   "CREATE TABLE ", tabel_name, " (",
   "date       timestamp,",
   "datatype   varchar(4),",
   "station    varchar(17),",
   "value      int,",
   "fl_m       varchar(2),",
   "fl_q       varchar(10),",
   "fl_so      varchar(1),",
   "fl_t       int,",
   "units      varchar(25)",
   ");")

The next two code blocks perform the tasks accomplished in the DBI::dbWriteTable()function above in R. In this next block, I execute the SQL statement to create the table.

result = LibPQ.execute(pg_connect(), exe_s)

Once the table has been created, we can bulk insert the DataFrame to the table. Per the LibPQ documentation, you get better performance structuring the process the way I did below. Again the first half of the block is just creating the string parameters, while the latter half actually performs the task of loading the data.

# Define the column names string for the SQL statement
colnames_s = join(map(string, names(wx_data_to_db)), ',')

# Define the column indices for the value portion of the SQL statement
value_s = string("\$", join(map(string, [1:1:ncol(wx_data_to_db);]), ",\$"))

# Bulk Insertion method greatly increases performance
LibPQ.execute(conn, "BEGIN;")
LibPQ.load!(wx_data_to_db,
   pg_connect(),
   string("INSERT INTO ", tabel_name, " (",colnames_s,") ",
     "VALUES (", value_s,");")
 )
LibPQ.execute(conn, "COMMIT;")

We can query to see that the table has been populated.

query_result = LibPQ.execute(pg_connect(),
   string("SELECT * FROM ", tabel_name, " LIMIT ;"))
query_results = DataFrame(query_result)
print(query_results)

Checking the tables in the database

Checking the two tables in the PostgreSQL database is ease to do. Notice that I named the tables differently to distinguish which came from where. Since R is my default, I don’t designate the language. For my project, I could easily append on “_r”, “_p” or “_j”, to annotate between the scripting languages to validate that they each produce the same results.

select * from weather limit 10;

Now, we query the table created in Julia

select * from weather_julia_load limit 10;

Resultant Julia Function

In light of the Julia process above, I created a function to act as a quasi-dbWriteTable stand in. It certainly has room for lots of improvement. Take it and improve upon this base to suit your needs.

Julia Function dbWriteTable_df

function dbWriteTable_df(;conn=pg_connect(),
                         df,
                         tabel_name,
                         return_head = true)

   colnames_s = join(map(string, names(df)), ',')
   value_s = string("\$", join(map(string, [1:1:ncol(df);]), ",\$"))

   # Bulk Insertion method greatly increases performance
   LibPQ.execute(conn, "BEGIN;")
   LibPQ.load!(df,
     conn,
     string("INSERT INTO ", tabel_name, " (",colnames_s,") ",
       "VALUES (", value_s,");")
   )
   LibPQ.execute(conn, "COMMIT;")

   if return_head
     query_result = LibPQ.execute(conn,
       string("SELECT * FROM ", tabel_name, " LIMIT 5;"))
     query_results = DataFrame(query_result)
     print(query_results)
   end
   printlin("Table Loading Complete") 
 end

PostgreSQL Table Creation and Bulk Insertion

Language and Package Versions

R

Julia

Connecting to PostgreSQL

R

Julia

Bulk Loading CSV to PostgreSQL

R

Julia

Checking the tables in the database

Resultant Julia Function

Julia Function dbWriteTable_df

Language Conversion: R to Julia

R Books

Julia Books

PostgreSQL and SQL Books on Amazon