Project Workflow: Data Analysis - Problem(x) Solutions

***This was created and valid in 2016. Data for this specific data project may no longer be current***

Data analysis is our investigation of the questions we set out to answer for our project during the planning stage. At this point we should have a good understanding of what are data contains, and various value distributions. Some might consider what we did in the exploration stage to be analysis, but that can be subjective. We gauged what the data contained and learned about what was available to us in order to conduct our analysis. The line between the two can be gray. What I focus on during this stage is a more in-depth understanding about the relationships within the data.

Analysis can encompass many things. Your overall objectives will define what it is you actually do here. The list below is mere a collection of analytic functions performed on the data.

Statistical Analysis
Numerical Analysis
Analysis of alternatives
Machine Learning
Classification
Modeling and Simulation
Spatial Analysis
Temporal Analysis
Regression Analysis
Network Analysis
Logistic Analysis
Operations Research (minimization and optimization)

This list is not an all encompassing, but aims to point out what you might do for your specific projects. There are plenty of resources out there for each type of analysis you want to do and is not the scope of this demonstration. In this example, we will explore the spatial and temporal relationships of the criminal activity.

During the exploration stage we identified what various field looked like and what we could do with them. In this example we will analyze them further to come up with more specific findings. The results would then support our project objectives that allow us to advise our audience. To keep this demonstration somewhat abbreviated, I will focus on a couple in-depth looks. In reality we would be going down several rabbit holes until we get to the answers we are interested in that support our objectives.

High-level Analysis

First lets assess when each of the various criminal offenses occur.

library(reshape2)
library(ggplot2)
library(ggthemes)

dcast(data = data_2011_2015,  
      formula = OFFENSE ~ REPORTDATETIME_hour, 
      fun.aggregate = length, 
      value.var = 'OFFENSE')
#                      OFFENSE   00  01   02   03   04  05   06   07   08   09   10   11   12   13   14   15   16   17   18   19   20   21   22   23
# 1                      ARSON    6   7    7   10   10  10    6    7    7    4    4    7    5    4   10    5    6    5    6    6   10    5    6    5
# 2 ASSAULT W/DANGEROUS WEAPON  623 657  644  557  455 291  183  164  191  287  315  337  410  436  437  430  495  593  582  651  700  749  732  805
# 3                   BURGLARY  429 396  385  288  267 267  284  482  529  665  786  827  869  956  878  930 1009 1132 1124 1026  941  908  698  630
# 4                   HOMICIDE  565   0    0    0    0   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
# 5        MOTOR VEHICLE THEFT  399 302  311  287  285 316  416  603  855 1025  990  937  840  835  816  756  761  699  646  641  590  545  507  460
# 6                    ROBBERY 1009 997  883  832  712 476  331  269  339  394  470  558  639  719  725  753  821  852  958 1047 1229 1396 1284 1354
# 7                  SEX ABUSE   88  55   54   57   59  45   55   51   45   30   44   57   49   50   70   57   52   68   56   58   66   65   76   66
# 8               THEFT F/AUTO  932 858 1094 1288 1022 858 1096 2253 3422 3564 3305 3336 3302 3266 2940 2877 2802 2448 2190 1889 1862 1749 1410 1372
# 9                THEFT/OTHER 1237 987  813  733  575 444  495  897 1755 2701 3278 3723 3929 4202 4325 4760 4972 4739 4643 4351 3859 3288 2267 1807


crimePlot_analysis1.data <- dcast(data = data_2011_2015,  
                                  formula = OFFENSE + REPORTDATETIME_hour ~ "Count", 
                                  fun.aggregate = length, 
                                  value.var = 'OFFENSE')

crimePlot_analysis1 <- 
  ggplot(data = crimePlot_analysis1.data, aes(x = REPORTDATETIME_hour, y =Count, group = OFFENSE, color = factor(OFFENSE))) + 
  geom_line(size = 0.8, stat = 'identity') + 
  scale_color_tableau() + 
  labs(x = 'Hour of Day', 
       y = 'Number of Incidents', 
       title = 'Washington D.C.\nAmount of Criminal Activity By Hour', 
       color = 'Offense') +
  theme(legend.position = 'top', 
        legend.background = element_rect(color = 'black'),              
        legend.key = element_rect(fill = 'white'),
        legend.text = element_text(size = 5),
        legend.title = element_text(size = 5),
        panel.border = element_rect(linetype = 'solid', fill = NA), 
        panel.spacing = unit(0.2, 'lines'), 
        strip.text = element_text(), 
        strip.background = element_rect(linetype = 'solid', color = 'black'), 
        axis.text = element_text(color = 'black'), 
        axis.ticks.length = unit(0, "cm")) +
  guides(col = guide_legend(ncol = 3, keyheight = 0.5, keywidth = 0.5))

# Notice that I modify the width and height of the graphic.
ggsave('images/crimePlot_analysis1.png', crimePlot_analysis1, width = 8, height = 5)

Next, we will look at the same information, but bring in the spatial groupings. We will start at a high-level like Ward.

crimePlot_analysis2.data <- dcast(data = data_2011_2015,  
                                  formula = WARD + OFFENSE + REPORTDATETIME_hour ~ "Count", 
                                  fun.aggregate = length, 
                                  value.var = 'OFFENSE')

# Remove the NA Ward values.  Previously we saw that they accounted for a couple records, so not significant enough for our purposes right now.
crimePlot_analysis2.data <- na.omit(crimePlot_analysis2.data)

crimePlot_analysis2 <- 
  ggplot(data = crimePlot_analysis2.data, aes(x = REPORTDATETIME_hour, y =Count, group = OFFENSE, color = factor(OFFENSE))) + 
  geom_line(size = 0.8, stat = 'identity') + 
  facet_grid(WARD ~ ., scales = "free_y") +
  scale_color_tableau() + 
  labs(x = 'Hour of Day', 
       y = 'Number of Incidents', 
       title = 'Washington D.C.\nAmount of Criminal Activity By Hour and Ward', 
       color = 'Offense') +
  theme(legend.position = 'top', 
        legend.background = element_rect(color = 'black'),              
        legend.key = element_rect(fill = 'white'),
        legend.text = element_text(size = 6),
        legend.title = element_text(size = 6),
        panel.border = element_rect(linetype = 'solid', fill = NA), 
        panel.spacing = unit(0.2, 'lines'), 
        strip.text = element_text(), 
        strip.background = element_rect(linetype = 'solid', color = 'black'), 
        axis.text = element_text(color = 'black'), 
        axis.ticks.length = unit(0, "cm")) +
  guides(col = guide_legend(ncol = 3, keyheight = 0.5, keywidth = 0.5))

# Notice that I modify the width and height of the graphic.
ggsave('images/crimePlot_analysis2.png', crimePlot_analysis2, width = 8, height = 8)

Here we can see how the activity fluctuates over time across each of the wards.

Focus and Scope

Since the “Theft/Other” and “Theft F/Auto” seem to over power the results from the other offenses. Let’s remove those to see how the plot look without them. Throughout the analysis, we can apply filters to focus on specific offenses. We can quantify whats been filtered out or partition the analysis by segmenting volume of activity.

# Remove "Theft/Other" and "Theft F/Auto" offenses.
crimePlot_analysis2.data_subset <- crimePlot_analysis2.data[crimePlot_analysis2.data$OFFENSE != "THEFT/OTHER" & crimePlot_analysis2.data$OFFENSE != "THEFT F/AUTO", ]

crimePlot_analysis2_subset <- 
  ggplot(data = crimePlot_analysis2.data_subset, aes(x = REPORTDATETIME_hour, y =Count, group = OFFENSE, color = factor(OFFENSE))) + 
  geom_line(size = 0.8, stat = 'identity') + 
  facet_grid(WARD ~ ., scales = "free_y") +
  scale_color_tableau() + 
  labs(x = 'Hour of Day', 
       y = 'Number of Incidents', 
       title = 'Washington D.C.\nAmount of Criminal Activity By Hour and Ward', 
       color = 'Offense') +
  theme(legend.position = 'top', 
        legend.background = element_rect(color = 'black'),              
        legend.key = element_rect(fill = 'white'),
        legend.text = element_text(size = 6),
        legend.title = element_text(size = 6),
        panel.border = element_rect(linetype = 'solid', fill = NA), 
        panel.spacing = unit(0.2, 'lines'), 
        strip.text = element_text(), 
        strip.background = element_rect(linetype = 'solid', color = 'black'), 
        axis.text = element_text(color = 'black'), 
        axis.ticks.length = unit(0, "cm")) +
  guides(col = guide_legend(nrow = 2, keyheight = 0.5, keywidth = 0.5))

# Notice that I modify the width and height of the graphic.
ggsave('images/crimePlot_analysis2_subset.png', crimePlot_analysis2_subset, width = 8, height = 8)

Now we can see the other offense much easier now. We could remove the next highest offense value or look at the plot with just one offense to help focus on individual activities across the Wards. The next graphic will show just one of the offenses to demonstrate how that would like.

# Remove "Theft/Other" and "Theft F/Auto" offenses.
crimePlot_analysis2.data_subset2 <- crimePlot_analysis2.data[crimePlot_analysis2.data$OFFENSE == "ASSAULT W/DANGEROUS WEAPON", ]

crimePlot_analysis2_subset2 <- 
  ggplot(data = crimePlot_analysis2.data_subset2, aes(x = REPORTDATETIME_hour, y =Count, group = OFFENSE, color = factor(OFFENSE))) + 
  geom_line(size = 0.8, stat = 'identity') + 
  facet_grid(WARD ~ ., scales = "free_y") +
  scale_color_tableau() + 
  labs(x = 'Hour of Day', 
       y = 'Number of Incidents', 
       title = 'Washington D.C.\nAmount of Criminal Activity By Hour and Ward\n[Assault with a Dangerous Weapon]', 
       color = 'Offense') +
  theme(legend.position = "none",
        panel.border = element_rect(linetype = 'solid', fill = NA), 
        panel.spacing = unit(0.2, 'lines'), 
        strip.text = element_text(), 
        strip.background = element_rect(linetype = 'solid', color = 'black'), 
        axis.text = element_text(color = 'black'), 
        axis.ticks.length = unit(0, "cm"))

# Notice that I modify the width and height of the graphic.
ggsave('images/crimePlot_analysis2_subset2.png', crimePlot_analysis2_subset2, width = 8, height = 8)

In this plot we can see the life-cycle of the “Assault with a Dangerous Weapon” across each of the Wards. Since we removed other offenses, we could have easily made the year of the offense a color feature. These changes help to characterize the activity by providing additional context. This just demonstrates the drilling down process into the data to develop patterns and artifacts. Looking at each of the offenses this way helps us to focus on each offense at a time. When all the offenses are displayed we can see the relative amount of activity. Focusing on one or two attributes at a time can with the analysis and draw out follow-on questions and tests.

Analyzing the Data Spatially

In this section we will continue with the examples above but plot them on a map. This can further help us characterize and understand additional relationships that we may not get from the work above. When we move to a map, we add spatial context. We can drill down further into some of the other spatial groupings to provide more contextualized details. This enhances our characterizations and allows us to be more specific. Context and specifics can help influence scheduling of services and community monitoring of specific geographic areas during certain times.

High-level Assessment

In this first instance, we will go straight into Ward 2, looking at the offenses by year. This will provide us a high-level overview of the area over a high-level temporal grouping. This will be a starting point to further explore the activity within Ward 2.

library(rgdal)
library(rgeos)
library(maptools)
library(plyr)
Ward.df <- data_2011_2015[data_2011_2015$WARD == 2,]

Ward_2 <- dcast(data = Ward.df, formula = WARD + OFFENSE + REPORTDATETIME_yr ~ "Count", fun.aggregate = length, value.var = 'OFFENSE')
colnames(Ward_2)[3] <- 'Year'
maxValue <- max(Ward_2$Count)
minValue <- min(Ward_2$Count)

WashDC_WARD <- readOGR(dsn = "./mapping/.", layer = "Ward_-_2012")
WashDC_WARD@data$id <- rownames(WashDC_WARD@data)
WashDC_WARD_2 <- WashDC_WARD[WashDC_WARD$WARD == 2,]

WashDC_WARD_2.points <- fortify(WashDC_WARD_2, region = "id")
WashDC_WARD_2.df <- join(WashDC_WARD_2.points, WashDC_WARD_2@data, by = "id")
WashDC_WARD_2.df <- join(WashDC_WARD_2.df, Ward_2, by = "WARD")

WashDC_WARD_plot <- ggplot(WashDC_WARD_2.df, aes(long, lat, group = group, fill = Count)) +
  geom_polygon() +
  geom_path(color = 'white') +
  coord_equal() +
  facet_grid(OFFENSE ~ Year, drop = F)+
  scale_fill_gradientn(colours=rev(x = rainbow(4)),
                       breaks= c(0,seq(0,maxValue,ceiling(maxValue/5)),maxValue),
                       na.value = 'black',
                       space = 'Lab',
                       guide = 'colourbar') +
  labs(x = 'Longitude', y = 'Latitude', 
       title = 'Washington D.C.\nDensity of Criminal Activity From 2011-2015 By Ward 2', 
       fill = 'Total Activity') +
  theme(panel.background  = element_rect(fill = 'gray20'), 
        panel.border  = element_rect(linetype  = 'solid', fill = NA), 
        panel.spacing  = unit(0.2, 'lines'), 
        strip.text.y = element_text(angle = 0), 
        strip.background  = element_rect(linetype = 'solid', color = 'black'), 
        axis.text  = element_text(color = 'black', size = 5),
        axis.text.x = element_text(angle = 90),
        axis.ticks.length  = unit(0, "cm"))

ggsave('./images/WashDC_WARD_2_plot.png', WashDC_WARD_plot, width  = 10, height  = 8)

As the graphic below depicts, theft from other is the most prominent activity over each of the years. We could filter by offense(s) to isolate activity to reduce the volume of information. This isolation enables us to focus analytical efforts on more important issues.

This next one focuses on assault with a dangerous weapon. We will look at this offense in more temporal granularity (hour of day by year). This specificity should give us some answers regarding general workforce scheduling of law enforcement officials. It can also help us assess whether particular activities are changing. Changes year to year may be traced back to policy or efforts that are having measurable effects.

Ward.df <- data_2011_2015[data_2011_2015$WARD == 2 & data_2011_2015$OFFENSE == "ASSAULT W/DANGEROUS WEAPON",]
Ward_2_adw <- dcast(data = Ward.df, formula = WARD + REPORTDATETIME_yr + REPORTDATETIME_hour ~ "Count", fun.aggregate = length, value.var = 'OFFENSE')
colnames(Ward_2_adw)[2] <- 'Year'
colnames(Ward_2_adw)[3] <- 'Hour'

maxValue <- max(Ward_2_adw$Count)
minValue <- min(Ward_2_adw$Count)

WashDC_WARD <- readOGR(dsn = "./mapping/.", layer = "Ward_-_2012")
WashDC_WARD@data$id <- rownames(WashDC_WARD@data)
WashDC_WARD_2 <- WashDC_WARD[WashDC_WARD$WARD == 2,]

WashDC_WARD_2.points <- fortify(WashDC_WARD_2, region = "id")
WashDC_WARD_2.df <- join(WashDC_WARD_2.points, WashDC_WARD_2@data, by = "id")
WashDC_WARD_2.df <- join(WashDC_WARD_2.df, Ward_2_adw, by = "WARD")

WashDC_WARD_2_adw_plot <- ggplot(WashDC_WARD_2.df, aes(long, lat, group = group, fill = Count)) +
  geom_polygon() +
  geom_path(color = 'white') +
  coord_equal() +
  facet_grid(Year ~ Hour, drop = F)+
  scale_fill_gradientn(colours=rev(x = rainbow(4)),
                       breaks= c(0,seq(0,maxValue,ceiling(maxValue/5)),maxValue),
                       na.value = 'black',
                       space = 'Lab',
                       guide = 'colourbar') +
  labs(x = 'Longitude', y = 'Latitude', 
       title = 'Washington D.C.\nDensity of Assault With a Dangerous Weapon Activty Within Ward 2 From 2011-2015 By Hour and Year', 
       fill = 'Total Activity') +
  theme(panel.background  = element_rect(fill = 'gray20'), 
        panel.border  = element_rect(linetype  = 'solid', fill = NA), 
        panel.spacing  = unit(0.2, 'lines'), 
        strip.text.y = element_text(angle = 0), 
        strip.background  = element_rect(linetype = 'solid', color = 'black'), 
        axis.text  = element_text(color = 'black', size = 5),
        axis.text.x = element_text(angle = 90),
        axis.ticks.length  = unit(0, "cm"))

ggsave('./images/WashDC_WARD_2_adw_plot.png', WashDC_WARD_2_adw_plot, width  = 14, height  = 6)

In the graphic, we can see the variations for each hour for each year. We can see that most of the activity occurs during the 1700 – 0500 hours. The peak of activity is around 0200 and 0300. Over the years, we can see how much the activity has fluctuated. By this view 2015 has some of the highest amounts over the hours. Now we can drill into the more granular details about the incidents to see exactly where the activity is occurring. We will also filter the data to just look at 2015. With smaller spatial groupings we can potentially isolate the problem areas and tag them as areas of interest.

Enhancing Spatial Context

The following will overlay the incidents onto the Ward 2 spatial grouping. This could easily be applied to the whole city. By focusing on one Ward, we can test our hypothesis on a smaller portion of the city. When we determine the best method to analyze and answer our question, we can scale back up.

# Filter the data specific to our interest
Ward.df <- data_2011_2015[data_2011_2015$WARD == 2 & 
                            data_2011_2015$OFFENSE == "ASSAULT W/DANGEROUS WEAPON", ]

# Aggregate the data by location and year
Ward_2_adw <- dcast(data = Ward.df, formula = BLOCKXCOORD + BLOCKYCOORD + REPORTDATETIME_yr ~ "Count", fun.aggregate = length, value.var = 'OFFENSE')
colnames(Ward_2_adw)[3] <- "Year"

# Sort the data by Count
Ward_2_adw <- Ward_2_adw[with(Ward_2_adw, order(Count)),]

#Convert Maryland State Plane coordinates provided in the data
nad83_coords <- data.frame(x=Ward_2_adw$BLOCKXCOORD, y=Ward_2_adw$BLOCKYCOORD)
coordinates(nad83_coords) <- c('x', 'y')
proj4string(nad83_coords)<-CRS("+init=esri:102285")
ConvertedCoords<-spTransform(nad83_coords,CRS("+init=epsg:4326"))
ConvertedCoords<-as.data.frame(ConvertedCoords)
colnames(ConvertedCoords)<-c('long','lat')
Ward_2_adw <- cbind(Ward_2_adw, ConvertedCoords)

# Define the Scale Range (max and min)
maxValue <- max(Ward_2_adw$Count)
minValue <- min(Ward_2_adw$Count)

# Read in the Ward data and filter it to just Ward 2
WashDC_WARD <- readOGR(dsn = "./mapping/.", layer = "Ward_-_2012")
WashDC_WARD@data$id <- rownames(WashDC_WARD@data)
WashDC_WARD_2 <- WashDC_WARD[WashDC_WARD$WARD == 2,]
WashDC_WARD_2.points <- fortify(WashDC_WARD_2, region = "id")
WashDC_WARD_2.df <- join(WashDC_WARD_2.points, WashDC_WARD_2@data, by = "id")

Ward2_by_year <- ggplot(WashDC_WARD_2.df, aes(long, lat)) +
  geom_polygon(fill = 'gray', alpha = .75) +
  geom_path(color = 'white') +
  coord_equal() +
  geom_point(data = Ward_2_adw, aes(x = long, y = lat, color = Count)) +
  scale_color_gradientn(colours=rev(x = rainbow(4)),
                        breaks= c(0,seq(0,maxValue,ceiling(maxValue/5)),maxValue),
                        na.value = 'black',
                        space = 'Lab',
                        guide = 'colourbar') +
  facet_wrap(facets = ~Year) +
  labs(x = 'Longitude', y = 'Latitude', 
       title = 'Washington D.C.\nDensity of Assault With a Dangerous Weapon Activty Within Ward 2 From 2011-2015 By Year', 
       color = 'Total Activity') +
  theme(panel.background  = element_rect(fill = 'gray20'), 
        panel.border  = element_rect(linetype  = 'solid', fill = NA), 
        panel.spacing  = unit(0.2, 'lines'), 
        strip.text.y = element_text(angle = 0), 
        strip.background  = element_rect(linetype = 'solid', color = 'black'), 
        axis.text  = element_text(color = 'black'),
        axis.ticks.length  = unit(0, "cm"))

ggsave('./images/Ward2_by_year_plot.png', Ward2_by_year, width  = 9, height  = 8)

We can see in the graphic that the activity is really concentrated in particular areas, with most occurring around the northeastern portion. Now lets overlay all the activity data throughout the city, but still filtered for just assault with a dangerous weapon.

# Filter the data specific to our interest
adw.df <- data_2011_2015[data_2011_2015$OFFENSE == "ASSAULT W/DANGEROUS WEAPON", ]

# Aggregate the data by location and year
Ward_adw <- dcast(data = adw.df, formula = WARD + BLOCKXCOORD + BLOCKYCOORD + REPORTDATETIME_yr ~ "Count", fun.aggregate = length, value.var = 'OFFENSE')
colnames(Ward_adw)[4] <- "Year"


# Sort the data by Count
Ward_adw <- Ward_adw[with(Ward_adw, order(WARD, Count)),]

#Convert Maryland State Plane coordinates provided in the data
nad83_coords <- data.frame(x=Ward_adw$BLOCKXCOORD, y=Ward_adw$BLOCKYCOORD)
coordinates(nad83_coords) <- c('x', 'y')
proj4string(nad83_coords)<-CRS("+init=esri:102285")
ConvertedCoords<-spTransform(nad83_coords,CRS("+init=epsg:4326"))
ConvertedCoords<-as.data.frame(ConvertedCoords)
colnames(ConvertedCoords)<-c('long','lat')
Ward_adw <- cbind(Ward_adw, ConvertedCoords)

# Define the Scale Range (max and min)
maxValue <- max(Ward_adw$Count)
minValue <- min(Ward_adw$Count)

# Get the years to facet the map by
Ward_adw_years <- data.frame(WARD = unique(Ward_adw$WARD))
Ward_adw_years <- merge(Ward_adw_years, 2011:2105)

# Read in the Ward data and filter it to just Ward 2
WashDC_WARD <- readOGR(dsn = "./mapping/.", layer = "Ward_-_2012")
WashDC_WARD@data$id <- rownames(WashDC_WARD@data)
WashDC_WARD <- WashDC_WARD[WashDC_WARD$WARD,]
WashDC_WARD.points <- fortify(WashDC_WARD, region = "id")
WashDC_WARD.df <- join(WashDC_WARD.points, WashDC_WARD@data, by = "id")

All_Wards_by_year <- ggplot(WashDC_WARD.df, aes(long, lat)) +
  geom_polygon(fill = 'gray', alpha = .75) +
  geom_path(color = 'white') +
  geom_point(data = Ward_adw, aes(x = long, y = lat, color = Count)) +
  coord_equal() +
  scale_color_gradientn(colours=rev(x = rainbow(4)),
                        breaks= c(0,seq(0,maxValue,ceiling(maxValue/5)),maxValue),
                        na.value = 'black',
                        space = 'Lab',
                        guide = 'colourbar') +
  facet_wrap(facets = ~Year) +
  labs(x = 'Longitude', y = 'Latitude', 
       title = 'Washington D.C.\nDensity of Assault With a Dangerous Weapon Activty From 2011-2015 By Year', 
       color = 'Total Activity') +
  theme(panel.background  = element_rect(fill = 'gray20'), 
        panel.border  = element_rect(linetype  = 'solid', fill = NA), 
        panel.spacing  = unit(0.2, 'lines'), 
        strip.text.y = element_text(angle = 0), 
        strip.background  = element_rect(linetype = 'solid', color = 'black'), 
        axis.text  = element_text(color = 'black'),
        axis.ticks.length  = unit(0, "cm"))

ggsave('./images/All_Wards_by_year_plot.png', All_Wards_by_year, width  = 9, height  = 8)

We can assess the concentration of activity over the years. Its also helpful to assess whether areas are improving, getting worse or staying roughly the same.

Conclusion

There are a multitude of ways we can slice and dice the data and investigate where the data leads us. This is where having objectives to accomplish come into the picture. Objectives give us boundaries to either prove or disprove our hypotheses. When we have accomplished what we set out to get out of the project and discover follow-on objectives, we now future work to pursue after we finish our initial project. If the new objectives help bolster our project then it can be worth incorporating into the effect. Beware of mission creep though.

Once our analysis has produced the results and we are confident the data appropriately responds to our hypotheses, we can move on to producing visualizations to help convey the evidence for or against your objectives.