Now that we have collected our data, lets work on building a database to store our project data. For this project I have decided to use a PostgreSQL database. The image below provides some planning details to help us implement the database.
Installing and Setting up
On a Linux machine, getting PostgreSQL installed and running is pretty easy and straight forward. Just follow the simple instructions from the PostgreSQL website (here). I have copy/pasted their code instructions below for simplicity. This does work on my Linux/Ubuntu 20.04 Computer.
# Create the file repository configuration:
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
# Import the repository signing key:
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
# Update the package lists:
sudo apt-get update
# Install the latest version of PostgreSQL.
# If you want a specific version, use 'postgresql-12' or similar instead of 'postgresql':
sudo apt-get -y install postgresql
Side note with install for an R Package
For one of the R packages below I did have to install libpq-dev
as well. This came up when I was trying to install the package. Once that libpq-dev
was installed I had no other issues on the R side of things.
Starting the service and Accessing
The first command we will give our terminal is to start the PostgreSQL service. We are running version 13 as well.
$ sudo systemctl start postgresql@13-main
Next we will switch to our postgres
account (default) on our just started server.
$ sudo -i -u postgres
The following images should be what you see after each terminal command. Notice in the second image, that once you type in psql
, we are now in the psql terminal/shell. From here we can issue all of our commands to interact, manage and query the database.
Creating the Database
In the images below you can see the results of running the following in Psql:
postgres=# create database project_crime;
postgres=# \l
The first command gets a response indicating the database was created. The second command lists the databases in our Postgres instance.
Creating a User
Now we are going to create a general analyst
user. We give this user a password during the command. The shell will respond with CREATE ROLE
as confirmation. Next we can grant privileges to this user for our project_crime
database. The shell’s response is GRANT
.
Now you are ready to import data and create table in your database. You can easily create tables using the Psql shell, but in this post, I will be demonstrating the rest in R. Please refer to the R Package section below for more details about the packages used. If you don’t need the info, skip to the next major section on “Creating Tables and Importing Data” below.
R Packages
I searched for R packages related to PostgreSQL. The results yielded DBI
, RPostgres
and RPostgreSQL
. Each will be briefly described below. For more information, refer to each packages documentation resources.
DBI
The DBI
package provides a database management systems (DBMS) backend interface for R. The package supports the following common/popular databases:
- PostgreSQL
- MariaDB
- SQLite
- ODBC
- BigQuery
More databases are also accessible, but check the package document for more information. Typically each specific database has accompanying R packages specifically to provide support to each database as well. The resources below also provides some code for setting up and interacting with a database via R:
# Install the latest DBI release from CRAN:
install.packages("DBI")
# Or the the development version from GitHub:
# install.packages("devtools")
devtools::install_github("r-dbi/DBI")
RPostgres
RPostgres is a DBI-compliant interface to the PostgreSQL database. It’s a ground-up rewrite using C++ and Rcpp.
# Install the latest RPostgres release from CRAN:
install.packages("RPostgres")
# Or the the development version from GitHub:
# install.packages("remotes")
remotes::install_github("r-dbi/RPostgres")
Resources
- https://rpostgres.r-dbi.org/
- https://db.rstudio.com/databases/postgresql/
- This one is really great and points you to a lot more information for R Packages and specific databases. It also has some information on best practices.
RPostgreSQL
There is not much difference between RPostgreSQL
and the RPostgres
packages. They each connect to the PostgreSQL database via the DBI
. The packages were made by separate groups. The following are resources for the RPostgreSQL
package:
- https://github.com/tomoakin/RPostgreSQL
- https://cran.r-project.org/web/packages/RPostgreSQL/RPostgreSQL.pdf
Other PostgreSQL and R Resources
- https://jdbc.postgresql.org/documentation/head/connect.html
- https://www.datacareer.de/blog/connect-to-postgresql-with-r-a-step-by-step-example/
Creating Tables and Importing Data
The following R code demonstrates connecting to the database I just created then creating a table from my imported weather data
library(tidyverse)
library(magrittr)
library(DBI)
# Connect to desired PostgreSQL database
db <- 'project_crime' # databast name to connect to
host_db <- 'localhost' # Where is db being hosted? default server/host is localhost
db_port <- '5432' # Which port is the server listening to? default port number for PostgreSQL is 5432
db_user <- 'analyst'
db_password <- 'mypassword'
con <- dbConnect(RPostgres::Postgres(),
dbname = db,
host=host_db,
port=db_port,
user=db_user,
password=db_password)
# Lets write some data to a new table in our db
wx_data_to_db <- read_csv(file = 'wx_data_2009_2019.csv')
dbWriteTable(conn = con,
name = "weather",
value = wx_data %>% as.data.frame(.))
# Look at the table in the database
dbReadTable(con, 'weather') %>% head()
After loading some of the other data I have downloaded and processed we can see that our tables are all in the database. In the R command line using the DBI::dbListTables()
function, referencing our connection configuration. In Psql (terminal-based front end), we can do the same with \dt
.
Remove Tables
If for some reason you decide to remove a table from your database, you can simply run a command to take care of that from R.
dbRemoveTable(conn = con, name = "NAME_OF_TABLE")
When I was creating all my tables, I noticed something was wrong with some data. I removed the table, re-processed the data to fix an error and then create the same table again. Easy fix that didn’t take much time to do.
What’s Next?
Now that we have the basics of storing our downloaded and processed data, we can continue to load new data into our database, begin to explore our data and eventually run our analyses to help us answer our questions. We can enhance our data further in the database or we can enhance the data in R then create a new table in our database. With all our data in one spot, we can also write queries to come up with aggregations, summary statistics, join tables, etc. In a later post, I’ll detail both methods (R and Psql) of data enhancement and transformation.
You can certain dig into the database management side of things to practice certain skills. As a Data Scientist, you may not ever need to dabble in this particular area of skills. Typically a Data Engineer or DBA will handle the database stuff.
I have had the opportunity on several projects to be both data engineer for my teams and a data scientist. The skills in both are quite useful and complimentary from my experience doing projects for clients. When conducting projects locally, it can make a lot of sense, however if you are working in an enterprise configuration, there may be strict protocols on who does what and ensuring that standards, governance and security are implemented accordingly.
SQL Resources
If you have not learned much SQL, I would highly recommend setting up an account at HackerRank.com. It is free and provides good problems to solve, which gives you experience. Work through their SQL problems. SQL is a very valuable skill to have and often expected for Data Scientists to know. If you are interested in Database problems, they have those too. The image below shows you all the skills available to learn from their platform.
For referencing or looking up commands, syntax, and examples use W3Schools.com and tutorialspoint.com.
Database Resources
For more information on PostgreSQL check out their website and documentation at https://www.postgresql.org/docs/13/index.html.
For more information on Psql check out this guide: http://www.postgresguide.com/utilities/psql.html.
Another resource for installing PostgreSQL on Linux/Ubuntu 20.04 with more information about everything covered above from the terminal: https://www.digitalocean.com/community/tutorials/how-to-install-and-use-postgresql-on-ubuntu-20-04
Planning Progress
I will reference back to my plan and update what we have so far. Using XMind’s icons, I put task completion status next to each data item to indicate level of progress. I have expanded a couple areas to add some granularity. If during the process I expand or modify my plan, I should make sure my plan reflects those changes. Project documentation is a good skill to have. If you document properly as you go, it will save you time and patience later.
Posts in Project Series
- Criminal Analysis: Planning
- Criminal Analysis: Data Search (part 0)
- Criminal Analysis: Data Search (part 1)
- Criminal Analysis: Data Search (part 2)
- Criminal Analysis: Data Search (part 3)
- Criminal Analysis: Data Storage
- Criminal Analysis: Data Storage (part 2)
- Criminal Analysis: Data Search (part 4)
- Derive a Star Schema By Example
- Criminal Analysis: Data Exploration
- Criminal Analysis: Data Exploration (part 1)
- Criminal Analysis: Data Exploration (part 2a)
- Criminal Analysis: Data Exploration (part 2b)
- Criminal Analysis: Data Storage (Part 3)