***This was created and valid in 2016. Links for this specific data project may no longer be current***
Finding the Data
The first step in the process is planning out what you want to do. Now that you planned out your project, its time to get the data. Data search works hand in hand with the planning process, because you should make plans around data that exists. This portion aims at finding the data.
As mentioned previously, the project was focused on criminal activity in Washington D.C. The date ranges of interest were 2010 to 2015. This date range facilitates creating a baseline of activity across several temporal grouping parameters. With the ability to understand the activity during that 6 year span, comparisons could then be made against the present year as well as any future data feeds.
I was able to execute some queries online using the search parameters {“criminal activity”, “Washington D.C.”, “data”}. It look some looking to find exactly what I wanted, but eventual came across the following site: http://data.octo.dc.gov/Metadata.aspx?id=3
The page lists details about the data set, including Points of Contact, Data Accuracy, and Attribute descriptions.
At the following site, http://data.octo.dc.gov/NewCalendar.aspx?datasetid=3, I was able to download the data for 2011, 2012, and 2013 readily. For 2010, 2014, and 2015, I had to conduct a custom download, which was also on the same page.
If there is a link to the data via URL or API that would have been great, but it is not necessarily available everywhere. Sometimes you need to fill out parameters in a user-form to download the data. In either case you can download it with the appropriate functions or manually download and put the file where you would like to work from.
For the 2011 data, I was able to copy the link address from the page.
How do you download the data?
If you have the URL to the data source you want, that is half the battle. The other half is knowing what to do with it. Each language has its syntax, to do the same operation. Depending on what you are attempting to download, you may need additional packages/libraries (ex. httr, RCurl). The following is what you can do in R:
# download.file(url = ,destfile = ,method = ,quiet = ,mode = ,cacheOK = ,extra = )
# Create a temporary file to store the download
temp <- tempfile()
# Download the data into the temp file
download.file(url = DC_crime_url, destfile = temp)
# Unzip the temp file into the specified directory
unzip(zipfile = temp, exdir = "./data/")
# Remove the temp file
unlink(temp)
Now that the file is downloaded and in my data folder, I can insert another URL for the next data set until I have all the data I need for the project. After investigating the base of the URL, I found that all the data was available at the base of the URL that I had from the example download above.
The following demonstrates of how you could write the script to download each of the files systematically.
DC_crime_url_base <- "http://data.octo.dc.gov/feeds/crime_incidents/archive/"
DC_crime_url_filebase <- "crime_incidents_"
DC_crime_url_fileend <- "_CSV.zip"
DC_crime_url_vect <- c(2010:2015)
DC_crime_url <- paste(DC_crime_url_base, DC_crime_url_filebase, DC_crime_url_vect, DC_crime_url_fileend, sep = '')
# Because I know the process of downloading one file, Ill create a
# function to do all the steps.
read_dc_data <- function(data){
temp <- tempfile()
download.file(url = data, destfile = temp)
unzip(zipfile = temp, exdir = "./data/")
unlink(temp)
}
# This applies the function custom function to the vector of URLs
sapply(X = DC_crime_url, FUN = read_dc_data)
Within less than 30 seconds, we were able to download all the data from 2010 to 2015. Within the folder we can also see that we could have downloaded data back to 2006.
When in doubt, query for how to use a function. The main idea is knowing how to download data or being able to query for help with using a specific function. Example internet query {“R” “download data”, “zip file”} might result in a solution or example of using the function. Add more query parameters if you want something more specific. For each function you find, there may be arguments that are not needed based on the data. What you fill into the function will be dependent on your needs and specifics of the target data set.
If you want to download the data into a directory as we just did, that works just fine. If you are downloading a series of files or different data sources, you may want to look into storing the data in a database to further develop the process. During the next stage, we will cover data storage in more detail.
Associated Posts
- Criminal Analysis: Data Storage (Part 3)
- Getting COVID-19 Data (Julia)
- Getting COVID-19 Data (Python)
- Getting COVID-19 Data (R)
- Criminal Analysis: Data Exploration (part 2b)
- Criminal Analysis: Data Exploration (part 2a)
- Criminal Analysis: Data Exploration (part 1)
- Criminal Analysis: Data Exploration
- Derive a Star Schema By Example
- Criminal Analysis: Data Search (part 4)