Project Workflow: Data Storage - Problem(x) Solutions

***This was created and valid in 2016. Links for this specific data project may no longer be current***

Where do you store all the data?

It is quite possible that you have to download a series of files in order to have a full set or may you have a lot a data sources that you would like to easily query and get the results you need. You have a few options, but the architecture really depends on your situation, technical skills (or willingness to learn) and time. You may decide to leave files in a specified folder and bring them in as you need them or you may find ingesting them into a database to be more effective. There are a lot of packages, libraries, and support extensions out there for the various languages and programs. Tableau has a list of databases it can connect to and the process is pretty straight forward on establishing the connection.

Read in the file(s)

There are many functions for reading in data. The primary method depends on the file type. Each file type has a function that works best for the file type you have downloaded. Typical files types are csv, xlsx, json, and xml.

Some example functions:

base::read.csv
openxlsx::read.xlsx
- Works great with larger xlsx files and does not run into the java heap size issue that the normal read.xlsx function has.
- The openxlsx package does not rely on rJava and its dependencies, but rather uses Rcpp.
- From my experiences, this works the best for tabulated data
rjsonlite::fromJSON, rjson::fromJSON
- rjsonlite package seems to work better.
XML::xmlParse, XML::xmlToDataFrame
- XML files can be trickier and really depends on the structure of the output file.
- There are plenty of resources out there that provide examples for XML reading

The following process gets all the files in the folder we downloaded the files into, then reads them into R. In order to merge the files, we need to make sure that each of the files align with each other.

# Get all the file names in the folder.
data_files <- dir(path = "./data", pattern = ".csv", all.files = T, full.names = T)

# If you want to read in the data and assign to a variable do this:
crime_data <- sapply(data_files,
                     function(x) read.csv(file = x, header = T, as.is = T, stringsAsFactors = F))

# This prints out the dimensions of each file
t(sapply(crime_data, dim))

# This prints out each of the files column headers
sapply(crime_data, colnames)

# After inspecting the column headers of each data.frame, 2010 data appeared different from 2011-2015 data.
# As a result, we're combining the data.frames that match
data_2010 <- do.call(rbind.data.frame,
                     crime_data[1])

data_2011_2015 <- do.call(rbind.data.frame,
                          crime_data[2:length(crime_data)])
rm(crime_data)

If you choose to just read in the files as you need them from the folder location, then skip to the next state Data Preparation. If you choose to ingest the data into database, you have several open source options, as will be detailed below.

There are plenty of open source databases for your needs. Downloading json files? Maybe try storing them in MongoDB. Have tabulated data? Try a PostgreSQL database. The next stage will go more into the storage portion and available resources. I have only listed some of the common databases by SQL and NoSQL categories.

Open Source SQL Databases with accompanying R packages: Pentaho

Open Source NoSQL Databases:

MongoDB
- Document-oriented database
- mongolite
- rmongodb (github)
Apache Cassandra
- Key-value and column-oriented hybrid database
- RCassandra
Neo4j
- Graph database
- RNeo4j (github)
CouchDB
- Document-oriented database
- couchDB
- R4CouchDB (github)

The format and structure of the data will determine a lot about which database is the best for your project data. Due to the tabulated structure of the Washington D.C. criminal activity data, I have decided to use a PostgreSQL database.

Ingest the data into a PostgreSQL Database

The following code will demonstrate the process of reading in the data into a PostgreSQL database. Each database mentioned above will have a similar process with specific functions corresponding to each database.

# If you want to read in the data into the database do the following:
library(devtools)
install_github("RcppCore/Rcpp")
install_github("rstats-db/DBI")
install_github("rstats-db/RPostgres")

library(DBI)
library(RPostgres)

user_password <- function(){
  x <- readline(prompt = "Enter your password: ")
  return(x)
}

# The database had to be setup prior.  
# I named the database 'crimeproject', the username was the default.
# The password was setup on creation.
conn <- dbConnect(RPostgres::Postgres(),
                 port='5432',
                 dbname='crimeproject',
                 user='postgres',
                 password=user_password())

# List the Tables in the database
dbListTables(conn)

crime_data <- list(data_2010, data_2011_2015)
db_table_names <- c("crime_data_2010", "crime_data_2011_2015")

# Write the data into tables in the database
for (i in 1:length(db_table_names)){
  dbWriteTable(conn, db_table_names[i], crime_data[[i]])
}

In the next stage I’ll go through the preparation processes.

Where do you store all the data?

Read in the file(s)

Ingest the data into a PostgreSQL Database

Associated Posts