***This was created and valid in 2016. Data for this specific data project may no longer be current***
Surveying the Data
Data Preparation is a valuable stage that steps up the rest of the stages in our project. It allows you to survey the data and figure out what needs to be done. Its your opportunity to identify potential transformations and enhance the data in the next stage. Preparation work also provides you a means of understanding what to expect and what you can use.
Viewing the unique values within a variable gains you that understands. Putting the data in a aggregation can help to further understand how a field is distributed among the unique values. The last part will be utilized in the data exploration stage.
Overall Data Survey
Due to the difference in contents of the data between 2010 and 2011-2015, we will use the data_2011_2015
dim(data_2011_2015)
# [1] 180310 20
data.frame(colnames(data_2011_2015))
# colnames.data_2011_2015.
# 1 CCN
# 2 REPORTDATETIME
# 3 SHIFT
# 4 OFFENSE
# 5 METHOD
# 6 LASTMODIFIEDDATE
# 7 BLOCKSITEADDRESS
# 8 BLOCKXCOORD
# 9 BLOCKYCOORD
# 10 WARD
# 11 ANC
# 12 DISTRICT
# 13 PSA
# 14 NEIGHBORHOODCLUSTER
# 15 BUSINESSIMPROVEMENTDISTRICT
# 16 BLOCK_GROUP
# 17 CENSUS_TRACT
# 18 VOTING_PRECINCT
# 19 START_DATE
# 20 END_DATE
Criminal Activity Survey
This is an overview of a couple columns related to the criminal activity in the dataset. This helps demonstrate our survey of the values and fields we will use to analyze.
data.frame(Unique_value = unique(data_2011_2015$OFFENSE))
# Unique_value
# 1 SEX ABUSE
# 2 HOMICIDE
# 3 BURGLARY
# 4 MOTOR VEHICLE THEFT
# 5 ROBBERY
# 6 THEFT F/AUTO
# 7 ASSAULT W/DANGEROUS WEAPON
# 8 THEFT/OTHER
# 9 ARSON
data.frame(Unique_value = unique(data_2011_2015$SHIFT))
# Unique_value
# 1 MIDNIGHT
# 2 DAY
# 3 EVENING
Temporal Data Survey
Going back the planning section, we set a goal of exploring temporal relationships with the criminal activity. In order to do this we need to investigate what the date/time fields look. Our findings will help us determine if we need to create additional temporal groupings or which field should be used as the primary.
data.frame(Unique_value = head(unique(data_2011_2015$REPORTDATETIME)))
# Unique_value
# 1 3/1/2011 12:00:00 AM
# 2 1/10/2011 12:00:00 AM
# 3 1/10/2011 1:45:00 PM
# 4 1/22/2011 4:30:00 PM
# 5 1/21/2011 12:00:00 AM
# 6 9/23/2011 8:10:00 PM
data.frame(Unique_value = head(unique(data_2011_2015$START_DATE)))
# Unique_value
# 1 12/31/2006 9:00:00 PM
# 2 6/30/2007 3:30:00 AM
# 3 12/23/2010 10:00:00 AM
# 4 6/25/2010 4:00:00 PM
# 5 8/2/2010 5:00:00 PM
# 6 9/21/2011 7:45:00 PM
data.frame(Unique_value = head(unique(data_2011_2015$END_DATE)))
# Unique_value
# 1 1/1/2007 2:30:00 AM
# 2 6/30/2007 3:30:00 AM
# 3 1/10/2011 12:00:00 PM
# 4 6/25/2010 7:00:00 PM
# 5 8/2/2010 12:00:00 AM
# 6 9/23/2011 7:00:00 PM
Some useful temporal groupings are year, year-month, Month, day of week (Sunday – Saturday), week number, hour of day and AM/PM. This gives us some data enhancement tasks.
Spatial Data Survey
Now on to the spatial fields for exploring the relationships with the criminal activity.
data.frame(BLOCK_GROUP = head(unique(data_2011_2015$BLOCK_GROUP)))
# BLOCK_GROUP
# 1 005900 1
# 2 007808 1
# 3 004002 2
# 4 004100 3
# 5 007503 1
# 6 005301 2
data.frame(BLOCKSITEADDRESS = head(unique(data_2011_2015$BLOCKSITEADDRESS)))
# BLOCK_GROUP
# 1 400 - 499 BLOCK OF H STREET NW
# 2 200 - 299 BLOCK OF 54TH STREET NE
# 3 1800 - 1899 BLOCK OF VERNON STREET NW
# 4 24TH STREET NW AND CALIFORNIA STREET NW
# 5 2200 - 2254 BLOCK OF MOUNT VIEW PLACE SE
# 6 1700 - 1799 BLOCK OF CHURCH STREET NW
data.frame(Lon = head(unique(data_2011_2015$BLOCKYCOORD)),
Lat = head(unique(data_2011_2015$BLOCKXCOORD)))
# Lon Lat
# 1 136882.0 398476.0
# 2 136049.0 406620.0
# 3 138791.0 396290.0
# 4 138635.2 395438.1
# 5 132834.0 400949.0
# 6 138059.2 396524.0
As we can see in the last command, these coordinates are not in decimal degree format, nor are they in degree minute second format. We need to go back to the website attribute documentation to understand what format and spatial projection the data is in. This sets up our transformation task in order to get the coordinates into a Lat/Lon format.
Spatial Groupings
Next we will survey the values available to us in the spatial grouping fields. Each of these fields was described on the data overview page. With this information we can look for geographic information systems (GIS) data that may be available on the internet that helps to put the data values into context. We could even use the shapefile from the GIS data to help us aggregate and plot the data on a map within each appropriate spatial grouping.
A simple query of {“Washington DC”, “GIS”, “data”} gave us DC GIS Services website and District of Columbia Open Data website. Within each site we can do follow on searches to get the appropriate spatial grouping files. There may also be data we find in these sources that we never considered, like public safety features, transportation data, and health and human services data. The sources may not be necessary but it could add a lot more dimensions to our analysis. For instance, how far away do criminal activities occur from certain locations? This is where we could develop that answer.
This is an example of growing our data sources as we go, which will require us to repeat the previous steps on the data we decide to add. Getting the GIS data in this instance may or may not be necessary to achieve our project objectives, but it does provide us with options when it comes to conducting spatial analysis and visualizing the spatially referenced data appropriately. Some data will not need much preparation or any transformations, like this GIS data. It does require us to come up with the appropriate data dictionary rules if the data values are different though.
# Unique values of the WARD field
unique(data_2011_2015$WARD)
# [1] 2 7 1 8 5 6 4 3 NA
# Unique values of the Advisory Neighborhood Commission (ANC) field
unique(data_2011_2015$ANC)
# [1] "2C" "7C" "1C" "2D" "8A" "2B" "5E" "6D" "6E" "5C" "4D" "1B" "8E" "8C" "6C" "6A" "1A" "8B" "7F" "8D" "7E" "7D" "6B" "7B" "3E" "4C" "4B" "3F" "2E" "2A" "3C" "5B" "5D" "2F" "4A"
# [36] "1D" "3B" "3G" "5A" "3D"
# Unique values of the DISTRICT field
unique(data_2011_2015$DISTRICT)
# [1] "FIRST" "SIXTH" "THIRD" "SECOND" "SEVENTH" "FIFTH" "FOURTH" "NONE"
# Unique values of the Police Service Area (PSA) field
unique(data_2011_2015$PSA)
# [1] "102" "608" "303" "208" "701" "101" "501" "105" "503" "207" "407" "305" "505" "302" "502" "703" "301" "705" "506" "507" "707" "104" "604" "706" "605" "408" "602" "107" "603"
# [30] "601" "202" "409" "404" "406" "201" "206" "108" "403" "402" "204" "504" "103" "704" "307" "308" "607" "606" "702" "708" "205" "304" "405" "401" "203" "106" "306" "Unk" NA
unique(data_2011_2015$NEIGHBORHOODCLUSTER)
# [1] 8 31 1 28 6 21 9 22 18 3 2 38 39 25 7 33 37 34 30 26 32 11 19 10 4 5 17 15 24 23 14 35 13 20 12 36 NA 27 16 29
unique(data_2011_2015$CENSUS_TRACT)
# [1] 5900 7808 4002 4100 7503 5301 5800 3302 11000 4701 11100 10700 2201 4400 9102 6400 3100 8701 7304 4300 9804 10600 7901 3200 10400 4901 10500 7504 8301
# [30] 9907 9811 7401 9905 9902 9700 7605 2802 7803 6600 7703 9604 9901 1001 7809 2900 9603 2502 4600 9507 1100 7406 202 2801 8001 2101 4001 5600 1902
# [59] 501 3800 9301 4801 8803 10100 8402 7403 2001 2702 10800 3000 1901 4902 NA 400 2301 702 7604 8802 5002 7603 2501 9602 9801 8302 10200 5201 7404
# [88] 7806 5500 300 9810 100 9803 7804 802 9903 4802 3700 8904 4201 7000 9505 9504 1804 7409 7709 7601 1803 1500 7807 801 8410 1200 9501 6900 7502
# [117] 6202 502 8804 7100 9508 8200 6802 7408 2600 2302 2400 600 7707 10900 9904 8702 4702 9509 8002 9807 1702 9204 701 9302 6700 9601 6801 3900 1600
# [146] 2202 8903 7407 2701 3301 3400 9203 9000 5001 4202 3500 2002 6804 3600 7708 1002 9503 7200 1302 8100 9400 10300 901 9906 1402 201 6500 2102 9802
# [175] 9201 7903 1301 1401 902 7301
unique(data_2011_2015$VOTING_PRECINCT)
# [1] "Precinct 143" "Precinct 96" "Precinct 25" "Precinct 13" "Precinct 114" "Precinct 15" "Precinct 129" "Precinct 19" "Precinct 127" "Precinct 1" "Precinct 72"
# [12] "Precinct 17" "Precinct 55" "Precinct 22" "Precinct 37" "Precinct 75" "Precinct 120" "Precinct 122" "Precinct 83" "Precinct 81" "Precinct 38" "Precinct 123"
# [23] "Precinct 21" "Precinct 128" "Precinct 84" "Precinct 103" "Precinct 125" "Precinct 119" "Precinct 105" "Precinct 110" "Precinct 121" "Precinct 112" "Precinct 39"
# [34] "Precinct 99" "Precinct 89" "Precinct 101" "Precinct 108" "Precinct 31" "Precinct 94" "Precinct 42" "Precinct 142" "Precinct 102" "Precinct 47" "Precinct 65"
# [45] "Precinct 33" "Precinct 118" "Precinct 6" "Precinct 141" "Precinct 41" "Precinct 104" "Precinct 56" "Precinct 4" "Precinct 64" "Precinct 136" "Precinct 35"
# [56] "Precinct 71" "Precinct 68" "Precinct 137" "Precinct 76" "Precinct 82" "Precinct 116" "Precinct 53" "Precinct 2" "Precinct 59" "Precinct 14" "Precinct 24"
# [67] "Precinct 106" "Precinct 12" "Precinct 46" "Precinct 11" "Precinct 111" "Precinct 77" "Precinct 124" "Precinct 32" "Precinct 48" "Precinct 100" "Precinct 40"
# [78] "Precinct 16" "Precinct 117" "Precinct 93" "Precinct 57" "Precinct 98" "Precinct 5" "Precinct 97" "Precinct 3" "Precinct 7" "Precinct 18" "Precinct 36"
# [89] "Precinct 79" "Precinct 30" "Precinct 90" "Precinct 43" "Precinct 126" "Precinct 60" "Precinct 113" "Precinct 140" "Precinct 61" "Precinct 23" "Precinct 51"
# [100] "Precinct 95" "Precinct 10" "Precinct 44" "Precinct 87" "Precinct 115" "Precinct 107" "Precinct 130" "Precinct 26" "Precinct 45" "Precinct 78" "Precinct 91"
# [111] "Precinct 66" "Precinct 49" "Precinct 27" "Precinct 86" "Precinct 63" "Precinct 74" "Precinct 28" "Precinct 88" "Precinct 92" "Precinct 58" "Precinct 62"
# [122] "Precinct 132" "Precinct 135" "Precinct 20" "Precinct 139" "Precinct 133" "Precinct 134" "Precinct 73" "Precinct 54" "Precinct 80" "Precinct 29" "Precinct 67"
# [133] "Precinct 131" "Precinct 34" "Precinct 85" "Precinct 69" "Precinct 70" "Precinct 9" "Precinct 50" "Precinct 109" "Precinct 138" "Precinct 52" "Precinct 8"
# [144] ""
- Criminal Analysis: Data Storage (Part 3)
- Getting COVID-19 Data (Julia)
- Getting COVID-19 Data (Python)
- Getting COVID-19 Data (R)
- Criminal Analysis: Data Exploration (part 2b)
- Criminal Analysis: Data Exploration (part 2a)
- Criminal Analysis: Data Exploration (part 1)
- Criminal Analysis: Data Exploration
- Derive a Star Schema By Example
- Criminal Analysis: Data Search (part 4)