***This was created and valid in 2016. Data for this specific data project may no longer be current***
Data analysis is our investigation of the questions we set out to answer for our project during the planning stage. At this point we should have a good understanding of what are data contains, and various value distributions. Some might consider what we did in the exploration stage to be analysis, but that can be subjective. We gauged what the data contained and learned about what was available to us in order to conduct our analysis. The line between the two can be gray. What I focus on during this stage is a more in-depth understanding about the relationships within the data.
Analysis can encompass many things. Your overall objectives will define what it is you actually do here. The list below is mere a collection of analytic functions performed on the data.
- Statistical Analysis
- Numerical Analysis
- Analysis of alternatives
- Machine Learning
- Classification
- Modeling and Simulation
- Spatial Analysis
- Temporal Analysis
- Regression Analysis
- Network Analysis
- Logistic Analysis
- Operations Research (minimization and optimization)
This list is not an all encompassing, but aims to point out what you might do for your specific projects. There are plenty of resources out there for each type of analysis you want to do and is not the scope of this demonstration. In this example, we will explore the spatial and temporal relationships of the criminal activity.
During the exploration stage we identified what various field looked like and what we could do with them. In this example we will analyze them further to come up with more specific findings. The results would then support our project objectives that allow us to advise our audience. To keep this demonstration somewhat abbreviated, I will focus on a couple in-depth looks. In reality we would be going down several rabbit holes until we get to the answers we are interested in that support our objectives.
High-level Analysis
First lets assess when each of the various criminal offenses occur.
library(reshape2)
library(ggplot2)
library(ggthemes)
dcast(data = data_2011_2015,
formula = OFFENSE ~ REPORTDATETIME_hour,
fun.aggregate = length,
value.var = 'OFFENSE')
# OFFENSE 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23
# 1 ARSON 6 7 7 10 10 10 6 7 7 4 4 7 5 4 10 5 6 5 6 6 10 5 6 5
# 2 ASSAULT W/DANGEROUS WEAPON 623 657 644 557 455 291 183 164 191 287 315 337 410 436 437 430 495 593 582 651 700 749 732 805
# 3 BURGLARY 429 396 385 288 267 267 284 482 529 665 786 827 869 956 878 930 1009 1132 1124 1026 941 908 698 630
# 4 HOMICIDE 565 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 5 MOTOR VEHICLE THEFT 399 302 311 287 285 316 416 603 855 1025 990 937 840 835 816 756 761 699 646 641 590 545 507 460
# 6 ROBBERY 1009 997 883 832 712 476 331 269 339 394 470 558 639 719 725 753 821 852 958 1047 1229 1396 1284 1354
# 7 SEX ABUSE 88 55 54 57 59 45 55 51 45 30 44 57 49 50 70 57 52 68 56 58 66 65 76 66
# 8 THEFT F/AUTO 932 858 1094 1288 1022 858 1096 2253 3422 3564 3305 3336 3302 3266 2940 2877 2802 2448 2190 1889 1862 1749 1410 1372
# 9 THEFT/OTHER 1237 987 813 733 575 444 495 897 1755 2701 3278 3723 3929 4202 4325 4760 4972 4739 4643 4351 3859 3288 2267 1807
crimePlot_analysis1.data <- dcast(data = data_2011_2015,
formula = OFFENSE + REPORTDATETIME_hour ~ "Count",
fun.aggregate = length,
value.var = 'OFFENSE')
crimePlot_analysis1 <-
ggplot(data = crimePlot_analysis1.data, aes(x = REPORTDATETIME_hour, y =Count, group = OFFENSE, color = factor(OFFENSE))) +
geom_line(size = 0.8, stat = 'identity') +
scale_color_tableau() +
labs(x = 'Hour of Day',
y = 'Number of Incidents',
title = 'Washington D.C.\nAmount of Criminal Activity By Hour',
color = 'Offense') +
theme(legend.position = 'top',
legend.background = element_rect(color = 'black'),
legend.key = element_rect(fill = 'white'),
legend.text = element_text(size = 5),
legend.title = element_text(size = 5),
panel.border = element_rect(linetype = 'solid', fill = NA),
panel.spacing = unit(0.2, 'lines'),
strip.text = element_text(),
strip.background = element_rect(linetype = 'solid', color = 'black'),
axis.text = element_text(color = 'black'),
axis.ticks.length = unit(0, "cm")) +
guides(col = guide_legend(ncol = 3, keyheight = 0.5, keywidth = 0.5))
# Notice that I modify the width and height of the graphic.
ggsave('images/crimePlot_analysis1.png', crimePlot_analysis1, width = 8, height = 5)
Next, we will look at the same information, but bring in the spatial groupings. We will start at a high-level like Ward.
crimePlot_analysis2.data <- dcast(data = data_2011_2015,
formula = WARD + OFFENSE + REPORTDATETIME_hour ~ "Count",
fun.aggregate = length,
value.var = 'OFFENSE')
# Remove the NA Ward values. Previously we saw that they accounted for a couple records, so not significant enough for our purposes right now.
crimePlot_analysis2.data <- na.omit(crimePlot_analysis2.data)
crimePlot_analysis2 <-
ggplot(data = crimePlot_analysis2.data, aes(x = REPORTDATETIME_hour, y =Count, group = OFFENSE, color = factor(OFFENSE))) +
geom_line(size = 0.8, stat = 'identity') +
facet_grid(WARD ~ ., scales = "free_y") +
scale_color_tableau() +
labs(x = 'Hour of Day',
y = 'Number of Incidents',
title = 'Washington D.C.\nAmount of Criminal Activity By Hour and Ward',
color = 'Offense') +
theme(legend.position = 'top',
legend.background = element_rect(color = 'black'),
legend.key = element_rect(fill = 'white'),
legend.text = element_text(size = 6),
legend.title = element_text(size = 6),
panel.border = element_rect(linetype = 'solid', fill = NA),
panel.spacing = unit(0.2, 'lines'),
strip.text = element_text(),
strip.background = element_rect(linetype = 'solid', color = 'black'),
axis.text = element_text(color = 'black'),
axis.ticks.length = unit(0, "cm")) +
guides(col = guide_legend(ncol = 3, keyheight = 0.5, keywidth = 0.5))
# Notice that I modify the width and height of the graphic.
ggsave('images/crimePlot_analysis2.png', crimePlot_analysis2, width = 8, height = 8)
Here we can see how the activity fluctuates over time across each of the wards.
Focus and Scope
Since the “Theft/Other” and “Theft F/Auto” seem to over power the results from the other offenses. Let’s remove those to see how the plot look without them. Throughout the analysis, we can apply filters to focus on specific offenses. We can quantify whats been filtered out or partition the analysis by segmenting volume of activity.
# Remove "Theft/Other" and "Theft F/Auto" offenses.
crimePlot_analysis2.data_subset <- crimePlot_analysis2.data[crimePlot_analysis2.data$OFFENSE != "THEFT/OTHER" & crimePlot_analysis2.data$OFFENSE != "THEFT F/AUTO", ]
crimePlot_analysis2_subset <-
ggplot(data = crimePlot_analysis2.data_subset, aes(x = REPORTDATETIME_hour, y =Count, group = OFFENSE, color = factor(OFFENSE))) +
geom_line(size = 0.8, stat = 'identity') +
facet_grid(WARD ~ ., scales = "free_y") +
scale_color_tableau() +
labs(x = 'Hour of Day',
y = 'Number of Incidents',
title = 'Washington D.C.\nAmount of Criminal Activity By Hour and Ward',
color = 'Offense') +
theme(legend.position = 'top',
legend.background = element_rect(color = 'black'),
legend.key = element_rect(fill = 'white'),
legend.text = element_text(size = 6),
legend.title = element_text(size = 6),
panel.border = element_rect(linetype = 'solid', fill = NA),
panel.spacing = unit(0.2, 'lines'),
strip.text = element_text(),
strip.background = element_rect(linetype = 'solid', color = 'black'),
axis.text = element_text(color = 'black'),
axis.ticks.length = unit(0, "cm")) +
guides(col = guide_legend(nrow = 2, keyheight = 0.5, keywidth = 0.5))
# Notice that I modify the width and height of the graphic.
ggsave('images/crimePlot_analysis2_subset.png', crimePlot_analysis2_subset, width = 8, height = 8)
Now we can see the other offense much easier now. We could remove the next highest offense value or look at the plot with just one offense to help focus on individual activities across the Wards. The next graphic will show just one of the offenses to demonstrate how that would like.
# Remove "Theft/Other" and "Theft F/Auto" offenses.
crimePlot_analysis2.data_subset2 <- crimePlot_analysis2.data[crimePlot_analysis2.data$OFFENSE == "ASSAULT W/DANGEROUS WEAPON", ]
crimePlot_analysis2_subset2 <-
ggplot(data = crimePlot_analysis2.data_subset2, aes(x = REPORTDATETIME_hour, y =Count, group = OFFENSE, color = factor(OFFENSE))) +
geom_line(size = 0.8, stat = 'identity') +
facet_grid(WARD ~ ., scales = "free_y") +
scale_color_tableau() +
labs(x = 'Hour of Day',
y = 'Number of Incidents',
title = 'Washington D.C.\nAmount of Criminal Activity By Hour and Ward\n[Assault with a Dangerous Weapon]',
color = 'Offense') +
theme(legend.position = "none",
panel.border = element_rect(linetype = 'solid', fill = NA),
panel.spacing = unit(0.2, 'lines'),
strip.text = element_text(),
strip.background = element_rect(linetype = 'solid', color = 'black'),
axis.text = element_text(color = 'black'),
axis.ticks.length = unit(0, "cm"))
# Notice that I modify the width and height of the graphic.
ggsave('images/crimePlot_analysis2_subset2.png', crimePlot_analysis2_subset2, width = 8, height = 8)
In this plot we can see the life-cycle of the “Assault with a Dangerous Weapon” across each of the Wards. Since we removed other offenses, we could have easily made the year of the offense a color feature. These changes help to characterize the activity by providing additional context. This just demonstrates the drilling down process into the data to develop patterns and artifacts. Looking at each of the offenses this way helps us to focus on each offense at a time. When all the offenses are displayed we can see the relative amount of activity. Focusing on one or two attributes at a time can with the analysis and draw out follow-on questions and tests.
Analyzing the Data Spatially
In this section we will continue with the examples above but plot them on a map. This can further help us characterize and understand additional relationships that we may not get from the work above. When we move to a map, we add spatial context. We can drill down further into some of the other spatial groupings to provide more contextualized details. This enhances our characterizations and allows us to be more specific. Context and specifics can help influence scheduling of services and community monitoring of specific geographic areas during certain times.
High-level Assessment
In this first instance, we will go straight into Ward 2, looking at the offenses by year. This will provide us a high-level overview of the area over a high-level temporal grouping. This will be a starting point to further explore the activity within Ward 2.
library(rgdal)
library(rgeos)
library(maptools)
library(plyr)
Ward.df <- data_2011_2015[data_2011_2015$WARD == 2,]
Ward_2 <- dcast(data = Ward.df, formula = WARD + OFFENSE + REPORTDATETIME_yr ~ "Count", fun.aggregate = length, value.var = 'OFFENSE')
colnames(Ward_2)[3] <- 'Year'
maxValue <- max(Ward_2$Count)
minValue <- min(Ward_2$Count)
WashDC_WARD <- readOGR(dsn = "./mapping/.", layer = "Ward_-_2012")
WashDC_WARD@data$id <- rownames(WashDC_WARD@data)
WashDC_WARD_2 <- WashDC_WARD[WashDC_WARD$WARD == 2,]
WashDC_WARD_2.points <- fortify(WashDC_WARD_2, region = "id")
WashDC_WARD_2.df <- join(WashDC_WARD_2.points, WashDC_WARD_2@data, by = "id")
WashDC_WARD_2.df <- join(WashDC_WARD_2.df, Ward_2, by = "WARD")
WashDC_WARD_plot <- ggplot(WashDC_WARD_2.df, aes(long, lat, group = group, fill = Count)) +
geom_polygon() +
geom_path(color = 'white') +
coord_equal() +
facet_grid(OFFENSE ~ Year, drop = F)+
scale_fill_gradientn(colours=rev(x = rainbow(4)),
breaks= c(0,seq(0,maxValue,ceiling(maxValue/5)),maxValue),
na.value = 'black',
space = 'Lab',
guide = 'colourbar') +
labs(x = 'Longitude', y = 'Latitude',
title = 'Washington D.C.\nDensity of Criminal Activity From 2011-2015 By Ward 2',
fill = 'Total Activity') +
theme(panel.background = element_rect(fill = 'gray20'),
panel.border = element_rect(linetype = 'solid', fill = NA),
panel.spacing = unit(0.2, 'lines'),
strip.text.y = element_text(angle = 0),
strip.background = element_rect(linetype = 'solid', color = 'black'),
axis.text = element_text(color = 'black', size = 5),
axis.text.x = element_text(angle = 90),
axis.ticks.length = unit(0, "cm"))
ggsave('./images/WashDC_WARD_2_plot.png', WashDC_WARD_plot, width = 10, height = 8)
As the graphic below depicts, theft from other is the most prominent activity over each of the years. We could filter by offense(s) to isolate activity to reduce the volume of information. This isolation enables us to focus analytical efforts on more important issues.
This next one focuses on assault with a dangerous weapon. We will look at this offense in more temporal granularity (hour of day by year). This specificity should give us some answers regarding general workforce scheduling of law enforcement officials. It can also help us assess whether particular activities are changing. Changes year to year may be traced back to policy or efforts that are having measurable effects.
Ward.df <- data_2011_2015[data_2011_2015$WARD == 2 & data_2011_2015$OFFENSE == "ASSAULT W/DANGEROUS WEAPON",]
Ward_2_adw <- dcast(data = Ward.df, formula = WARD + REPORTDATETIME_yr + REPORTDATETIME_hour ~ "Count", fun.aggregate = length, value.var = 'OFFENSE')
colnames(Ward_2_adw)[2] <- 'Year'
colnames(Ward_2_adw)[3] <- 'Hour'
maxValue <- max(Ward_2_adw$Count)
minValue <- min(Ward_2_adw$Count)
WashDC_WARD <- readOGR(dsn = "./mapping/.", layer = "Ward_-_2012")
WashDC_WARD@data$id <- rownames(WashDC_WARD@data)
WashDC_WARD_2 <- WashDC_WARD[WashDC_WARD$WARD == 2,]
WashDC_WARD_2.points <- fortify(WashDC_WARD_2, region = "id")
WashDC_WARD_2.df <- join(WashDC_WARD_2.points, WashDC_WARD_2@data, by = "id")
WashDC_WARD_2.df <- join(WashDC_WARD_2.df, Ward_2_adw, by = "WARD")
WashDC_WARD_2_adw_plot <- ggplot(WashDC_WARD_2.df, aes(long, lat, group = group, fill = Count)) +
geom_polygon() +
geom_path(color = 'white') +
coord_equal() +
facet_grid(Year ~ Hour, drop = F)+
scale_fill_gradientn(colours=rev(x = rainbow(4)),
breaks= c(0,seq(0,maxValue,ceiling(maxValue/5)),maxValue),
na.value = 'black',
space = 'Lab',
guide = 'colourbar') +
labs(x = 'Longitude', y = 'Latitude',
title = 'Washington D.C.\nDensity of Assault With a Dangerous Weapon Activty Within Ward 2 From 2011-2015 By Hour and Year',
fill = 'Total Activity') +
theme(panel.background = element_rect(fill = 'gray20'),
panel.border = element_rect(linetype = 'solid', fill = NA),
panel.spacing = unit(0.2, 'lines'),
strip.text.y = element_text(angle = 0),
strip.background = element_rect(linetype = 'solid', color = 'black'),
axis.text = element_text(color = 'black', size = 5),
axis.text.x = element_text(angle = 90),
axis.ticks.length = unit(0, "cm"))
ggsave('./images/WashDC_WARD_2_adw_plot.png', WashDC_WARD_2_adw_plot, width = 14, height = 6)
In the graphic, we can see the variations for each hour for each year. We can see that most of the activity occurs during the 1700 – 0500 hours. The peak of activity is around 0200 and 0300. Over the years, we can see how much the activity has fluctuated. By this view 2015 has some of the highest amounts over the hours. Now we can drill into the more granular details about the incidents to see exactly where the activity is occurring. We will also filter the data to just look at 2015. With smaller spatial groupings we can potentially isolate the problem areas and tag them as areas of interest.
Enhancing Spatial Context
The following will overlay the incidents onto the Ward 2 spatial grouping. This could easily be applied to the whole city. By focusing on one Ward, we can test our hypothesis on a smaller portion of the city. When we determine the best method to analyze and answer our question, we can scale back up.
# Filter the data specific to our interest
Ward.df <- data_2011_2015[data_2011_2015$WARD == 2 &
data_2011_2015$OFFENSE == "ASSAULT W/DANGEROUS WEAPON", ]
# Aggregate the data by location and year
Ward_2_adw <- dcast(data = Ward.df, formula = BLOCKXCOORD + BLOCKYCOORD + REPORTDATETIME_yr ~ "Count", fun.aggregate = length, value.var = 'OFFENSE')
colnames(Ward_2_adw)[3] <- "Year"
# Sort the data by Count
Ward_2_adw <- Ward_2_adw[with(Ward_2_adw, order(Count)),]
#Convert Maryland State Plane coordinates provided in the data
nad83_coords <- data.frame(x=Ward_2_adw$BLOCKXCOORD, y=Ward_2_adw$BLOCKYCOORD)
coordinates(nad83_coords) <- c('x', 'y')
proj4string(nad83_coords)<-CRS("+init=esri:102285")
ConvertedCoords<-spTransform(nad83_coords,CRS("+init=epsg:4326"))
ConvertedCoords<-as.data.frame(ConvertedCoords)
colnames(ConvertedCoords)<-c('long','lat')
Ward_2_adw <- cbind(Ward_2_adw, ConvertedCoords)
# Define the Scale Range (max and min)
maxValue <- max(Ward_2_adw$Count)
minValue <- min(Ward_2_adw$Count)
# Read in the Ward data and filter it to just Ward 2
WashDC_WARD <- readOGR(dsn = "./mapping/.", layer = "Ward_-_2012")
WashDC_WARD@data$id <- rownames(WashDC_WARD@data)
WashDC_WARD_2 <- WashDC_WARD[WashDC_WARD$WARD == 2,]
WashDC_WARD_2.points <- fortify(WashDC_WARD_2, region = "id")
WashDC_WARD_2.df <- join(WashDC_WARD_2.points, WashDC_WARD_2@data, by = "id")
Ward2_by_year <- ggplot(WashDC_WARD_2.df, aes(long, lat)) +
geom_polygon(fill = 'gray', alpha = .75) +
geom_path(color = 'white') +
coord_equal() +
geom_point(data = Ward_2_adw, aes(x = long, y = lat, color = Count)) +
scale_color_gradientn(colours=rev(x = rainbow(4)),
breaks= c(0,seq(0,maxValue,ceiling(maxValue/5)),maxValue),
na.value = 'black',
space = 'Lab',
guide = 'colourbar') +
facet_wrap(facets = ~Year) +
labs(x = 'Longitude', y = 'Latitude',
title = 'Washington D.C.\nDensity of Assault With a Dangerous Weapon Activty Within Ward 2 From 2011-2015 By Year',
color = 'Total Activity') +
theme(panel.background = element_rect(fill = 'gray20'),
panel.border = element_rect(linetype = 'solid', fill = NA),
panel.spacing = unit(0.2, 'lines'),
strip.text.y = element_text(angle = 0),
strip.background = element_rect(linetype = 'solid', color = 'black'),
axis.text = element_text(color = 'black'),
axis.ticks.length = unit(0, "cm"))
ggsave('./images/Ward2_by_year_plot.png', Ward2_by_year, width = 9, height = 8)
We can see in the graphic that the activity is really concentrated in particular areas, with most occurring around the northeastern portion. Now lets overlay all the activity data throughout the city, but still filtered for just assault with a dangerous weapon.
# Filter the data specific to our interest
adw.df <- data_2011_2015[data_2011_2015$OFFENSE == "ASSAULT W/DANGEROUS WEAPON", ]
# Aggregate the data by location and year
Ward_adw <- dcast(data = adw.df, formula = WARD + BLOCKXCOORD + BLOCKYCOORD + REPORTDATETIME_yr ~ "Count", fun.aggregate = length, value.var = 'OFFENSE')
colnames(Ward_adw)[4] <- "Year"
# Sort the data by Count
Ward_adw <- Ward_adw[with(Ward_adw, order(WARD, Count)),]
#Convert Maryland State Plane coordinates provided in the data
nad83_coords <- data.frame(x=Ward_adw$BLOCKXCOORD, y=Ward_adw$BLOCKYCOORD)
coordinates(nad83_coords) <- c('x', 'y')
proj4string(nad83_coords)<-CRS("+init=esri:102285")
ConvertedCoords<-spTransform(nad83_coords,CRS("+init=epsg:4326"))
ConvertedCoords<-as.data.frame(ConvertedCoords)
colnames(ConvertedCoords)<-c('long','lat')
Ward_adw <- cbind(Ward_adw, ConvertedCoords)
# Define the Scale Range (max and min)
maxValue <- max(Ward_adw$Count)
minValue <- min(Ward_adw$Count)
# Get the years to facet the map by
Ward_adw_years <- data.frame(WARD = unique(Ward_adw$WARD))
Ward_adw_years <- merge(Ward_adw_years, 2011:2105)
# Read in the Ward data and filter it to just Ward 2
WashDC_WARD <- readOGR(dsn = "./mapping/.", layer = "Ward_-_2012")
WashDC_WARD@data$id <- rownames(WashDC_WARD@data)
WashDC_WARD <- WashDC_WARD[WashDC_WARD$WARD,]
WashDC_WARD.points <- fortify(WashDC_WARD, region = "id")
WashDC_WARD.df <- join(WashDC_WARD.points, WashDC_WARD@data, by = "id")
All_Wards_by_year <- ggplot(WashDC_WARD.df, aes(long, lat)) +
geom_polygon(fill = 'gray', alpha = .75) +
geom_path(color = 'white') +
geom_point(data = Ward_adw, aes(x = long, y = lat, color = Count)) +
coord_equal() +
scale_color_gradientn(colours=rev(x = rainbow(4)),
breaks= c(0,seq(0,maxValue,ceiling(maxValue/5)),maxValue),
na.value = 'black',
space = 'Lab',
guide = 'colourbar') +
facet_wrap(facets = ~Year) +
labs(x = 'Longitude', y = 'Latitude',
title = 'Washington D.C.\nDensity of Assault With a Dangerous Weapon Activty From 2011-2015 By Year',
color = 'Total Activity') +
theme(panel.background = element_rect(fill = 'gray20'),
panel.border = element_rect(linetype = 'solid', fill = NA),
panel.spacing = unit(0.2, 'lines'),
strip.text.y = element_text(angle = 0),
strip.background = element_rect(linetype = 'solid', color = 'black'),
axis.text = element_text(color = 'black'),
axis.ticks.length = unit(0, "cm"))
ggsave('./images/All_Wards_by_year_plot.png', All_Wards_by_year, width = 9, height = 8)
We can assess the concentration of activity over the years. Its also helpful to assess whether areas are improving, getting worse or staying roughly the same.
Conclusion
There are a multitude of ways we can slice and dice the data and investigate where the data leads us. This is where having objectives to accomplish come into the picture. Objectives give us boundaries to either prove or disprove our hypotheses. When we have accomplished what we set out to get out of the project and discover follow-on objectives, we now future work to pursue after we finish our initial project. If the new objectives help bolster our project then it can be worth incorporating into the effect. Beware of mission creep though.
Once our analysis has produced the results and we are confident the data appropriately responds to our hypotheses, we can move on to producing visualizations to help convey the evidence for or against your objectives.