Searching for Criminal Activity Data
My initial search begins with querying for “Washington DC” and “Crime Data”. This is very broad but it lets me see what is available. I can always get more specific.
https://duckduckgo.com/?q=Washington+DC+crime+data&t=lm&ia=web
My results yielded a couple pages of interest. Check out the sites that seem interesting or could provide the desired data.
First site
First, I checked out https://mpdc.dc.gov/page/statistics-and-data. On this page, I was able to learn about what information they had available.
- Crime Statistics and Data
- Traffic Data
- Unsolved Cases and Missing Persons
- Reports
- Additional Crime Data Resources
Explore the information as needed. There was a lot of good information and background about their data. Some of the tables they provided would be good to check against the data we get to validate. It could also point out if we see any statistical changes/differences in what they are displaying.
Next Site
After scrolling down some, I saw this link to a government open data site, https://opendata.dc.gov/. Open Data sites can be great starting points, though they are not the full solution typically.
Filtering down to datasets, I get the following:
(URL) https://opendata.dc.gov/search?collection=Dataset&q=Crime
As I scroll down, I can see the incidents by year. They have data in my 2009-2019 range, as well as incident data for 2020. For now, Ill list the URLs to each year of incidents.
- https://opendata.dc.gov/datasets/crime-incidents-in-2009
- https://opendata.dc.gov/datasets/crime-incidents-in-2010
- https://opendata.dc.gov/datasets/crime-incidents-in-2011
- https://opendata.dc.gov/datasets/crime-incidents-in-2012
- https://opendata.dc.gov/datasets/crime-incidents-in-2013
- https://opendata.dc.gov/datasets/crime-incidents-in-2014
- https://opendata.dc.gov/datasets/crime-incidents-in-2015
- https://opendata.dc.gov/datasets/crime-incidents-in-2016
- https://opendata.dc.gov/datasets/crime-incidents-in-2017
- https://opendata.dc.gov/datasets/crime-incidents-in-2018
- https://opendata.dc.gov/datasets/crime-incidents-in-2019
- https://opendata.dc.gov/datasets/crime-incidents-in-2020
On each page, I can look at the metadata and see what data is available for the particular year.
Reviewing the Data
The next few sections will cover what is on each of the criminal incident pages. They are all structured the same.
Overview Page
The Overview tab on the 2009 data page (link) lets us explore some of the data fields and information about the dataset.
When you select some of the attribute fields, you can see the distribution of values. This is a good initial exploration of the dataset before our data exploration efforts in R or Python.
Metadata Sample
Using the following link you can explore the metadata associated with the 2009 data. It would be wise to explore this for each dataset, though they are likely to all be the same for this particular source across each of the years.
APIs
The API Explorer tab lets us define a custom API or use a predefined Query URL. If we don’t think we need all the fields, we could limit what the API returns
At the bottom, I can also click “Try it Out” to explore the structure of the output JSON file. This is a nice way to look at the structure. In my previous project the Latitude and Longitude fields where in a very different format specific to a unique reference system, that required a transformation. Now I it looks like I wont have do worry about that.
When I selected the API drop-down menu I also grabbed the URL for the GeoJSON file.
https://opendata.arcgis.com/datasets/73cd2f2858714cd1a7e2859f8e6e4de4_33.geojson
Download data
The page also offers a few different means to just download the data. In past projects, I have downloaded the Spreadsheet of the full dataset. There are pros and cons to API vs Download, but for the purpose of my posts, I’ll use the API.
Script the API URLs
Now that I have recorded each of the API URLs, I began compiling into an R script or other language. As you can see below the structure of the URLs are not what you would expect considering the chronology of years. Ideally each data URL would have progressed from /33
to /44
, which would have made the script much cleaner and wouldn’t require visiting each API page.
# Assess the URLs and reduce to unique portions
# base_url + crime_data_YY + query_end_url
base_url <- 'https://maps2.dcgis.dc.gov/dcgis/rest/services/FEEDS/MPD/MapServer'
query_end_url <- '/query?where=1%3D1&outFields=*&outSR=4326&f=json'
crime_data_09 <- '/33'
crime_data_10 <- '/34'
crime_data_11 <- '/35'
crime_data_12 <- '/11'
crime_data_13 <- '/10'
crime_data_14 <- '/9'
crime_data_15 <- '/27'
crime_data_16 <- '/26'
crime_data_17 <- '/38'
crime_data_18 <- '/0'
crime_data_19 <- '/1'
crime_data_20 <- '/2'
crime_data_last30days <- '/8'
Using GeoJSON API version, I used the following.
base_url <- 'https://opendata.arcgis.com/datasets'
crime_data_geojson <- c(
'2009' = '73cd2f2858714cd1a7e2859f8e6e4de4_33.geojson',
'2010' = 'fdacfbdda7654e06a161352247d3a2f0_34.geojson',
'2011' = '9d5485ffae914c5f97047a7dd86e115b_35.geojson',
'2012' = '010ac88c55b1409bb67c9270c8fc18b5_11.geojson',
'2013' = '5fa2e43557f7484d89aac9e1e76158c9_10.geojson',
'2014' = '6eaf3e9713de44d3aa103622d51053b5_9.geojson',
'2015' = '35034fcb3b36499c84c94c069ab1a966_27.geojson',
'2016' = 'bda20763840448b58f8383bae800a843_26.geojson',
'2017' = '6af5cb8dc38e4bcbac8168b27ee104aa_38.geojson',
'2018' = '38ba41dd74354563bce28a359b59324e_0.geojson',
'2019' = 'f08294e5286141c293e9202fcd3e8b57_1.geojson',
'2020' = 'f516e0dd7b614b088ad781b0c4002331_2.geojson'
)
# Assess the URLs and reduce to unique portions
# base_url + crime_data_YY + query_end_url
base_url = "https://maps2.dcgis.dc.gov/dcgis/rest/services/FEEDS/MPD/MapServer"
query_end_url = "query?where=1%3D1&outFields=*&outSR=4326&f=json"
crime_data_json = Dict(
"2009" => "33",
"2010" => "34",
"2011" => "35",
"2012" => "11",
"2013" => "10",
"2014" => "9",
"2015" => "27",
"2016" => "26",
"2017" => "38",
"2018" => "0",
"2019" => "1",
"2020" => "2"
)
# API URL to the last 30 days of criminal activity
crime_data_last30days = "/8"
Using the GeoJSON API, I structured the code a little different with named tuple instead of a dictionary structure.
base_url = "https://opendata.arcgis.com/datasets/"
crime_data_geojson_tup = (
"2009" => "73cd2f2858714cd1a7e2859f8e6e4de4_33.geojson",
"2010" => "fdacfbdda7654e06a161352247d3a2f0_34.geojson",
"2011" => "9d5485ffae914c5f97047a7dd86e115b_35.geojson",
"2012" => "010ac88c55b1409bb67c9270c8fc18b5_11.geojson",
"2013" => "5fa2e43557f7484d89aac9e1e76158c9_10.geojson",
"2014" => "6eaf3e9713de44d3aa103622d51053b5_9.geojson",
"2015" => "35034fcb3b36499c84c94c069ab1a966_27.geojson",
"2016" => "bda20763840448b58f8383bae800a843_26.geojson",
"2017" => "6af5cb8dc38e4bcbac8168b27ee104aa_38.geojson",
"2018" => "38ba41dd74354563bce28a359b59324e_0.geojson",
"2019" => "f08294e5286141c293e9202fcd3e8b57_1.geojson",
"2020" => "f516e0dd7b614b088ad781b0c4002331_2.geojson"
)
# Assess the URLs and reduce to unique portions
# base_url + crime_data_YY + query_end_url
base_url = "https://maps2.dcgis.dc.gov/dcgis/rest/services/FEEDS/MPD/MapServer"
query_end_url = "query?where=1%3D1&outFields=*&outSR=4326&f=json"
crime_data_json = [
{"year" : "2009", "doc" : "33"},
{"year" : "2010", "doc" : "34"},
{"year" : "2011", "doc" : "35"},
{"year" : "2012", "doc" : "11"},
{"year" : "2013", "doc" : "10"},
{"year" : "2014", "doc" : "9"},
{"year" : "2015", "doc" : "27"},
{"year" : "2016", "doc" : "26"},
{"year" : "2017", "doc" : "38"},
{"year" : "2018", "doc" : "0"},
{"year" : "2019", "doc" : "1"},
{"year" : "2020", "doc" : "2"}
]
# API URL to the last 30 days of criminal activity
crime_data_last30days = "/8"
Using the GeoJSON API.
base_url = "https://opendata.arcgis.com/datasets/"
crime_data_geojson_dict = [
{"year" : "2009", "doc" : "73cd2f2858714cd1a7e2859f8e6e4de4_33.geojson"},
{"year" : "2010", "doc" : "fdacfbdda7654e06a161352247d3a2f0_34.geojson"},
{"year" : "2011", "doc" : "9d5485ffae914c5f97047a7dd86e115b_35.geojson"},
{"year" : "2012", "doc" : "010ac88c55b1409bb67c9270c8fc18b5_11.geojson"},
{"year" : "2013", "doc" : "5fa2e43557f7484d89aac9e1e76158c9_10.geojson"},
{"year" : "2014", "doc" : "6eaf3e9713de44d3aa103622d51053b5_9.geojson"},
{"year" : "2015", "doc" : "35034fcb3b36499c84c94c069ab1a966_27.geojson"},
{"year" : "2016", "doc" : "bda20763840448b58f8383bae800a843_26.geojson"},
{"year" : "2017", "doc" : "6af5cb8dc38e4bcbac8168b27ee104aa_38.geojson"},
{"year" : "2018", "doc" : "38ba41dd74354563bce28a359b59324e_0.geojson"},
{"year" : "2019", "doc" : "f08294e5286141c293e9202fcd3e8b57_1.geojson"},
{"year" : "2020", "doc" : "f516e0dd7b614b088ad781b0c4002331_2.geojson"}
]
# We can access the dictionary using the index
crime_data_geojson_dict[0]
# To access either the year or the doc values, simply reference either for the given list element.
crime_data_geojson_dict[0]["year"]
crime_data_geojson_dict[0]["doc"]
Notice that I grabbed the API to their last 30 days of incidents. This will become handy down the road in our project.
The full and revised script will be provided on my GitHub. Follow the links below.
Planning Progress
I will reference back to my plan and update what we have so far. Using XMind’s icons, I put task completion status next to the data I have. For the 2020 data, I left that as short of complete due to the incompleteness of 2020.
In the next post, I’ll explore getting the next pieces of data.
GitHub Link
For the associated language referenced, refer to the appropriate “scripts_[language]” directory in the repo. Also scripts are not designed to be run blindly but are meant for the user to explore and understand the processes. Feel free to modify to best suit your needs.
Posts in Project Series
- Criminal Analysis: Planning
- Criminal Analysis: Data Search (part 0)
- Criminal Analysis: Data Search (part 1)
- Criminal Analysis: Data Search (part 2)
- Criminal Analysis: Data Search (part 3)
- Criminal Analysis: Data Storage
- Criminal Analysis: Data Storage (part 2)
- Criminal Analysis: Data Search (part 4)
- Derive a Star Schema By Example
- Criminal Analysis: Data Exploration
- Criminal Analysis: Data Exploration (part 1)
- Criminal Analysis: Data Exploration (part 2a)
- Criminal Analysis: Data Exploration (part 2b)
- Criminal Analysis: Data Storage (Part 3)