Project Workflow: Data Exploration - Problem(x) Solutions

***This was created and valid in 2016. Data for this specific data project may no longer be current***

In this section, we will take a look at pivot tables in R. This can help us quickly summarize the data and get a rough distribution of the values. We will also explore some initial visualizations to help us explore the data. The visualization in this section are to help us guide our decisions for what to analyze. Some software we could use in this portion is Tableau, MS Excel, and R. Though not free outside our current academic license, Tableau is a great tool for exploratory analysis. the data pretty quickly

Pivot Tables In R

Pivot Tables are great for exploring the data and getting understand of what is going on. In R we can access the dcast function within the reshape2 package. I will demonstrate how to utilize it. Those familiar with MS Excel and the pivot tables they have will like this portion. Those who have not used a pivot table, this will be valuable going forward. A big difference between Excel and R when it comes to pivot tables is how interactive they are. In Excel, you also don’t have to know the code or syntax. R doesn’t have the interactiveness, but you can certainly produce a lot more systematically and quickly. Again, it comes down to personal comfort.

The basics of running a pivot table in R:

# Load the reshape2 library to access the dcast function.  If you dont want to load the package you can also access it like this reshape2::dcast()
library(reshape2)

# Here we show the distinct values for the REPORTDATETIME_yr field we created.
unique(data_2011_2015$REPORTDATETIME_yr)
[1] "2011" "2012" "2013" "2014" "2015"

# Basic Pivot Table
dcast(data = data_2011_2015,
      formula = REPORTDATETIME_yr ~ "Count", fun.aggregate = length, value.var = 'OFFENSE')
#   REPORTDATETIME_yr Count
# 1              2011 33965
# 2              2012 35439
# 3              2013 35911
# 4              2014 38434
# 5              2015 36561

The dcast function takes the arguments data, formula, function aggregate and value variable is structured. Data is self explanatory. The formula is structured as “Rows” vs “Columns”. In the example, we set rows to REPORTDATETIME_yr and columns to “Count”. Notice that “Count” is quoted in the formula. This signifies that it is a text value rather than a column value. We are labeling the column “Count”, but we could just as easily labeled it “School House Rocks”. The example below will demonstrate using a variable from the data in the columns portion. Also we separate the rows and columns arguments with a ~. The function aggregate argument indicates how we should aggregate the values.

dcast(data = data_2011_2015,
      formula = REPORTDATETIME_yr ~ OFFENSE, fun.aggregate = length, value.var = 'OFFENSE')
#   REPORTDATETIME_yr ARSON ASSAULT W/DANGEROUS WEAPON BURGLARY HOMICIDE MOTOR VEHICLE THEFT ROBBERY SEX ABUSE THEFT F/AUTO THEFT/OTHER
# 1              2011    44                       2179     3952      108                3375    4163       172         9176       10796
# 2              2012    35                       2295     3682       88                2863    4269       258         9469       12480
# 3              2013    35                       2393     3357      104                2669    3994       292        10184       12883
# 4              2014    26                       2467     3180      105                3121    3269       319        11333       14614
# 5              2015    18                       2390     2535      160                2794    3352       332        10973       14007

dcast(data = data_2011_2015,
      formula =  OFFENSE ~ REPORTDATETIME_yr, fun.aggregate = length, value.var = 'OFFENSE')
#                     OFFENSE  2011  2012  2013  2014  2015
# 1                      ARSON    44    35    35    26    18
# 2 ASSAULT W/DANGEROUS WEAPON  2179  2295  2393  2467  2390
# 3                   BURGLARY  3952  3682  3357  3180  2535
# 4                   HOMICIDE   108    88   104   105   160
# 5        MOTOR VEHICLE THEFT  3375  2863  2669  3121  2794
# 6                    ROBBERY  4163  4269  3994  3269  3352
# 7                  SEX ABUSE   172   258   292   319   332
# 8               THEFT F/AUTO  9176  9469 10184 11333 10973
# 9                THEFT/OTHER 10796 12480 12883 14614 14007

If you wanted to get the value distributions for each variable we could run this and inspect each element within the list. You may not want to do this for all variables because not every variable is important to know what the distribution is, like the unique id number “CNN”.

list_pivots <- list()

# list the columns that I want to evaluate. Avoid POSIXlt formated columns, because it will throw you an error.
colnums_to_evaluate <- c(3:5, 10:18, 22:28)

# Direct the output to the variable.
list_pivots <- lapply(X = colnums_to_evaluate, 
                      FUN = function(X) dcast(data = data_2011_2015, formula =  data_2011_2015[,X] ~ "Count", fun.aggregate = length, value.var = 'OFFENSE'))

# Name of the list elements by it corresponding column name
names(list_pivots) <- colnames(data_2011_2015[ , colnums_to_evaluate])

# Inspect each element that you are interested in.
list_pivots[1]
# $SHIFT
#   data_2011_2015[, X] Count
# 1                 DAY 70722
# 2             EVENING 76563
# 3            MIDNIGHT 33025

list_pivots[19]
# $REPORTDATETIME_ampm
#   data_2011_2015[, X]  Count
# 1                  AM  66261
# 2                  PM 114049

Exploratory Visualizations

During the exploration phase it is often good to visualize the data. Sometimes you may see something unique that you may not get out of a summary pivot table. With visualization you can quickly understand and interpret the summary results in a different manner. Depending on the type of data, viewing a visualization may prove to be much more valuable and enable you to make follow on decisions in a much more expedient manner.

The following graphics will be samples of exploratory visualizations and may not necessarily cover the variations or indepthness that you may experience during your project(s). For each visualization, I will provide the R code to demonstrate how I created it. If you see something that would be useful to you, take the code and modify it to fit your needs. There are all sorts of variations of visualization so you have to identify what works best for you. Knowing your data and the type of content it covers is an important part to understanding what type of visualizations are best.

Explore Activity in Time Series

The following gives a good overview of all the activity over the duration of the data. Time series charts are not always the best, but it was for this particular dataset.

#****************************************************************************************
#                               Time Series Activity
#****************************************************************************************
library(ggplot2)
library(scales)
crimePlot0data <- as.data.frame(dcast(data_2011_2015,as.Date(data_2011_2015$REPORTDATETIME,'%m/%d/%Y') ~ "Count", 
                                      fun.aggregate = length, value.var = 'OFFENSE'))
# Rename a column if you choose
colnames(crimePlot0data)[1] <- 'Date'


head(crimePlot0data)
#         Date Count
# 1 2011-01-01    71
# 2 2011-01-02    68
# 3 2011-01-03    76
# 4 2011-01-04    90
# 5 2011-01-05    68
# 6 2011-01-06    93

# We can see if the amount of offenses are correlated with the data.
cor(y = crimePlot0data$Count,
    x = as.numeric(crimePlot0data$Date))

# We can save off a graphic by assigning it to a variable for later use if we need it.
crimePlot0 <- 
  ggplot(data = crimePlot0data, aes(x = Date,y = Count)) +
  geom_line() + 
  geom_smooth(method = lm, se = FALSE, color = 'red', size = 1) +
  scale_x_date(date_breaks = '1 year', date_minor_breaks  =  "3 months", date_labels = '%Y-%b', 
               limits = c(as.Date('2011-01-01'), as.Date('2015-12-31'))) +
  labs(x = 'Date',
       y = 'Number of Incidents',
       title = 'Washington D.C.\nCriminal Activity From 2011-2015') +
  theme(legend.key = element_rect(fill = 'white'),
        panel.border = element_rect(linetype = 'solid', fill = NA),
        panel.margin = unit(0.2, 'lines'),
        strip.text = element_text(),
        strip.background = element_rect(linetype = 'solid', color = 'black'),
        axis.text = element_text(color = 'black'),
        axis.ticks.length = unit(0, "cm"))

print(crimePlot0)

In the time series chart, we can see the fluctuations throughout the years, with a general upward trend. There were really low levels of activity during the early months of 2014 and 2015, with elevated activity levels during the late-summer/early-fall of 2014. The red line gives me a quick understanding of the overall trend of activity.

I could dig into this more at the yearly level and compare each year in the follow-on analysis stage. Remember, this is the data exploration stage. We merely want to get an idea about what is going on in our data so we can make follow-on decisions about what we could analyze with more rigor later on.

The exploration may help us modify our initial hypotheses or our project scope if we discover something that will make our analysis better. During each stage we need to reference back to the objectives we developed in the planning stage to make sure we are going in the right direction. It is very easy to get side-tracked.

Overall Activity by Month

Here we look at the over all activity by month. The first thing I do is to set the levels for the categorical variable. This ensures that when the values are printed, the order will be as I have defined them here. Once again, Ill save the graphical output to crimePlot1.

#****************************************************************************************
#                               Overall Activity By Month
#****************************************************************************************
library(ggthemes)
data_2011_2015$REPORTDATETIME_month <- factor(data_2011_2015$REPORTDATETIME_month, 
                                              levels = c("January", "February", "March",
                                                         "April", "May", "June", "July",
                                                         "August", "September", "October",
                                                         "November", "December"))
#Initial Overview and Volumn of Criminal Incidents as Reported by DC Metroplitan Police Department (DC MPD)
crimePlot1<-
  qplot(x = data_2011_2015$REPORTDATETIME_month, fill = factor( data_2011_2015$REPORTDATETIME_yr)) + 
  scale_fill_tableau() + 
  labs(x = 'Month', 
       y = 'Number of Incidents', 
       title = 'Washington D.C.\nOverall Criminal Activity From 2011-2015', 
       fill = 'Years') +
  guides(fill = guide_legend(reverse = T)) +
  theme(legend.key = element_rect(fill = 'white'), 
        panel.border = element_rect(linetype = 'solid', fill = NA), 
        panel.margin = unit(0.2, 'lines'), 
        strip.text = element_text(), 
        strip.background = element_rect(linetype = 'solid', color = 'black'), 
        axis.text = element_text(color = 'black'), 
        axis.text.y = element_text(size = 8), 
        axis.ticks.length = unit(0, "cm"))

print(crimePlot1)

# If you want to save off the graphic use the following:
ggsave(filename = "images/crimePlot1.png", plot = crimePlot1)

# You can tailor each of the arguments as you need.  Check out the documentation (??ggsave) for more information about the function.

In the graphic, we can see that the amount of activity is highest during the summer months and relatively lower at the beginning of the year. This may be a help us later on if we want to maybe pull in weather date to see what temperatures are possibly associated with the levels of activity.

The graphic above might be better for another dataset, because we want to look at how each year compares to each other. For this we create a line graph for each of the years. The code snippet below allows us to depict the data this way.

#****************************************************************************************
#                Overall Activity (By Month)
#****************************************************************************************

crimePlot2data <- dcast(data_2011_2015, REPORTDATETIME_month+REPORTDATETIME_yr ~ "Amount")
colnames(crimePlot2data) <- c('Month', 'Year', 'Amount')

crimePlot2 <- 
  ggplot(crimePlot2data, aes(x = Month, y = Amount, group = Year, color = factor(Year))) + 
  geom_line(size = 1, stat = 'identity') + 
  scale_color_tableau() + 
  labs(x = 'Month', 
       y = 'Number of Incidents', 
       title = 'Washington D.C.\nOverall Criminal Activity By Month', 
       color = 'Years') +
  theme(legend.position = 'top', 
        legend.background = element_rect(color = 'black'),              
        legend.key = element_rect(fill = 'white'), 
        panel.border = element_rect(linetype = 'solid', fill = NA), 
        panel.margin = unit(0.2, 'lines'), 
        strip.text = element_text(), 
        strip.background = element_rect(linetype = 'solid', color = 'black'), 
        axis.text = element_text(color = 'black'), 
        axis.ticks.length = unit(0, "cm"))

# Notice that I modify the width and height of the graphic.
ggsave('images/crimePlot2.png', crimePlot2, width = 8, height = 3.5)

This looks much better than the previous graphic and tells us a little more about how each of the months compare across each of the years. We can easily see how each of the years have a general flow about them that is rather seasonal.

This would also help support the inclusion of weather data to see how the high levels of activity compare to the hi temperatures and non-rainy, and vice versa with the low levels and cooler temperatures or rainy day. Some of the levels of activity might also be associated with various venues and events in the city as well.

How Do Temporal Parameters Relate?

This next one, will give us a heatmap to give us an overview of all activity by month and day of the week. This combination can help use understand if there are specific trends for each of the days across the months. Maybe we will find that certain days have higher or lower levels of activity throughout the year. This might help us ask more questions, develop additional hypotheses or reevaluate our objectives in light of the way the data unfolds itself.

#****************************************************************************************
#        Activity (By Month and DOW)
#****************************************************************************************
data_2011_2015$REPORTDATETIME_dow <- factor(data_2011_2015$REPORTDATETIME_dow, 
                                            levels = c("Sunday", "Monday", "Tuesday", 
                                                       "Wednesday", "Thursday", "Friday", 
                                                       "Saturday"))

MonthToDay <- dcast(data_2011_2015, REPORTDATETIME_month ~ REPORTDATETIME_dow,  
                    fun.aggregate = length, value.var = 'OFFENSE')
MonthToDay1 <- melt(data = MonthToDay, 
                    id.vars = "REPORTDATETIME_month", 
                    variable.name = "REPORTDATETIME_dow")
maxValue <- max(MonthToDay1$value)
minValue <- min(MonthToDay1$value)

crimePlot3 <- 
  ggplot(MonthToDay1, aes(x = REPORTDATETIME_month, y = REPORTDATETIME_dow)) +
  geom_tile(aes(fill = value)) +
  scale_fill_gradientn(colours = rev(x = rainbow(4)), 
                       breaks = c(minValue, seq(minValue, maxValue, ceiling((maxValue-minValue)/4)), maxValue), 
                       na.value = 'black', 
                       space = 'Lab') +
  labs(title = 'Washington D.C.\nCriminal Activity Throughout The Year By Month And Day Of The Week', 
       y = '', x = '', 
       fill = 'Frequency') +
  theme(panel.background = element_rect(fill = 'white'), 
        panel.border = element_rect(linetype = 'solid', fill = NA), 
        panel.margin = unit(0.2, 'lines'), 
        strip.text = element_text(), 
        strip.background = element_rect(linetype = 'solid', color = 'black'), 
        axis.text = element_text(size = 8, color = 'black'), 
        axis.ticks.length = unit(0, "cm"))

ggsave('images/crimePlot3.png', crimePlot3, width = 8, height = 3.5)

In the resultant visualization, we can quickly pick out when Washington D.C. has the most and least activity. There are a few events that also stand out like Mondays in the month of December. We can also see that Monday through Friday tend to have higher amounts of activity relative to the weekends, but not for every month. Something must also be going on on Wednesday’s in July.

Wrap Up Temporal Exploration

I could keep going on, creating lots of different graphics with my data without having to actually get down into the analysis. You may spend a lot of time exploring the data in this fashion, but modifying the view or swapping out the different variables or even testing out different visualization methods.

Regardless, the key is to explore and move on to the analysis, which is where you will spend more time figuring out why various artifacts discovered in the exploration process are the way they are. We will also investigate how artifacts relate to other variables. In the next section, we will explore Tableau as another means of exploring the data.

Samples In Tableau

Although we have access to Tableau during this course, it may not be available everywhere you go. It is a good tool to know how to use. You should know some of its capabilities and some of its limitations. The cost per license outside of the academic setting may leave you and/or your company falling back to another data visualization software. It is good to know this upfront to help temper your expectations.

Using Tableau as part of your exploration process can be very beneficial. You can easily create multiple graphics with little effort, besides a little drag and drop and some formatting adjustments.

Something that I have noticed with some analysts is that they forget what is happening on the aggregation side. You should ask yourself with each visualization, “Is this what I should be seeing?”. Check your variables, method of aggregation, data type, etc. I have caught myself on several occasions thinking one thing to find out I had a variable listed as a sum when I needed it as a dimension.

There are many combinations of errors that can go wrong, but you learn from it. The following will show you the same criminal data but we will look explore some of the other variables and combinations with time and spatial groupings without looking at it spatially (at least for right now).

Explore Offenses over Temporal Parameters

This graphic gives us a view as to how much activity there was by each offense over the three shifts. For each offense there is a Method variable. In this dataset I did not find the method to be very useful. For high level purposes I might not use this, which will affect my consideration on using that variable in the analysis and more in-depth visualization stages. This highlights the good that comes out of the exploration process.

For quick modifications of the chart, I can easily drop one of the other time variables on the rows input section on Tableau to facet my plot to help me break down the previous graphic and compare the days of the week for this particular instance. From this view I can see how much Theft related activity stands out as the primary offense type in the dataset and how prevalent it is across all the days, and shifts. We can also see how robbery stands out a little more during the midnight shift and their was a gun involved more often.

This next one looks at years instead of days of the week. We get about the same characteristics in this one as we did in the previous.

Lets say we want to look at a specific offense type to get a general overview. We can drop the offense variable in to the “page” area, which will allow us to flip through each of the offenses and look at each type in the same graphic without cluttering up the space as much. I replaced offense in the columns area with the Police Service Area (PSA) variable. The spatial grouping variables do not necessarily have to be viewed on a map or analyzed spatially, though it can help develop our understanding of the spatial relationships. For this we can see that certain PSA experience a particular level of homicide activity over each of the years. We can also see that there is an elevated amount in the 500 – 700 PSA range.

Spatial Data

For this particular dataset, we are able to plot the individual data points in space. In order to give the spatial points more meaning, it necessitates having some sort of shapefile or API connection to a mapping service (Google Maps, Bing Maps, OpenStreetMaps, etc). The following will serve as a high-level introduction. The coding is not as bad as it looks and you will see more in the Visualizations stage.

#****************************************************************************************
#                       MAPS
#****************************************************************************************
library(rgdal)
library(rgeos)
library(maptools)
library(plyr)

WashDC <- readShapeSpatial(fn = "./mapping/DC_Boundary.shp")
str(WashDC)
# Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
# ..@ data       :'data.frame': 1 obs. of  9 variables:
#   .. ..$ OBJECTID  : int 1
# .. ..$ CITY_NAME : Factor w/ 1 level "Washington": 1
# .. ..$ STATE_CITY: int 1150000
# .. ..$ CAPITAL   : Factor w/ 1 level "Y": 1
# .. ..$ WEB_URL   : Factor w/ 1 level "http://www.dc.gov": 1
# .. ..$ AREAKM    : num 177
# .. ..$ AREAMILES : num 68.5
# .. ..$ Shape_Leng: num 67608
# .. ..$ Shape_Area: num 1.77e+08
# .. ..- attr(*, "data_types")= chr [1:9] "N" "C" "N" "C" ...
# ..@ polygons   :List of 1
# .. ..$ :Formal class 'Polygons' [package "sp"] with 5 slots
# .. .. .. ..@ Polygons :List of 1
# .. .. .. .. ..$ :Formal class 'Polygon' [package "sp"] with 5 slots
# .. .. .. .. .. .. ..@ labpt  : num [1:2] -77 38.9
# .. .. .. .. .. .. ..@ area   : num 0.0184
# .. .. .. .. .. .. ..@ hole   : logi FALSE
# .. .. .. .. .. .. ..@ ringDir: int 1
# .. .. .. .. .. .. ..@ coords : num [1:12093, 1:2] -77.1 -77.1 -77 -76.9 -77 ...
# .. .. .. ..@ plotOrder: int 1
# .. .. .. ..@ labpt    : num [1:2] -77 38.9
# .. .. .. ..@ ID       : chr "0"
# .. .. .. ..@ area     : num 0.0184
# ..@ plotOrder  : int 1
# ..@ bbox       : num [1:2, 1:2] -77.1 38.8 -76.9 39
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr [1:2] "x" "y"
# .. .. ..$ : chr [1:2] "min" "max"
# ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
# .. .. ..@ projargs: chr NA

WashDC@data$id <- rownames(WashDC@data)

WashDC.points <- fortify(WashDC)

WashDC.df <- join(WashDC.points, WashDC@data, by = "id")

str(WashDC.df)
# 'data.frame': 12093 obs. of  16 variables:
#   $ long      : num  -77.1 -77.1 -77 -76.9 -77 ...
# $ lat       : num  38.9 38.9 39 38.9 38.8 ...
# $ order     : int  1 2 3 4 5 6 7 8 9 10 ...
# $ hole      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
# $ piece     : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
# $ id        : chr  "0" "0" "0" "0" ...
# $ group     : Factor w/ 1 level "0.1": 1 1 1 1 1 1 1 1 1 1 ...
# $ OBJECTID  : int  1 1 1 1 1 1 1 1 1 1 ...
# $ CITY_NAME : Factor w/ 1 level "Washington": 1 1 1 1 1 1 1 1 1 1 ...
# $ STATE_CITY: int  1150000 1150000 1150000 1150000 1150000 1150000 1150000 1150000 1150000 1150000 ...
# $ CAPITAL   : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
# $ WEB_URL   : Factor w/ 1 level "http://www.dc.gov": 1 1 1 1 1 1 1 1 1 1 ...
# $ AREAKM    : num  177 177 177 177 177 ...
# $ AREAMILES : num  68.5 68.5 68.5 68.5 68.5 ...
# $ Shape_Leng: num  67608 67608 67608 67608 67608 ...
# $ Shape_Area: num  1.77e+08 1.77e+08 1.77e+08 1.77e+08 1.77e+08 ...

WashDC_plot <- ggplot(WashDC.df, aes(long, lat, group = group)) +
  geom_polygon() +
  geom_path(color = 'white') +
  coord_equal() +
  labs(title = 'Washington D.C.', x = 'Longitude', y = 'Latitude')

ggsave('./images/WashDCBoundary.png', WashDC_plot)

What the R code did was read in the boundary shapefile of Washington D.C. that I downloaded. Then I transform and define the structure. After that the plot portion is much easier. I will go into this is more depth at a later date. The code above now produces the following graphic. Now that we can read in shapefiles and plot them in R, we can move into more high-level exploration.

Ward Level Activity

This next graphic will just plot the criminal activity at the Ward level. This will just serve as an example of capability to survey the data at the spatial grouping levels. Most datasets will not need this, nor will they have it available without some additional data enhancement and joining with other datasets, but it is an option. This type of exploration helps you to understand how activity is distributed throughout Washington D.C. We can do this same method for each spatial grouping variable in the same manner.

#****************************************************************************************
#                               Ward
#****************************************************************************************
# Plot the Ward level with data
library(reshape2)
Ward.df <- as.data.frame(dcast(data_2011_2015, WARD ~ 'Count', fun.aggregate = length, value.var = 'OFFENSE', drop = T))
#   WARD Count
# 1    1 26047
# 2    2 32218
# 3    3  9388
# 4    4 16506
# 5    5 24301
# 6    6 28120
# 7    7 22505
# 8    8 21216
# 9   NA     9
Ward.df <- na.omit(Ward.df) # This removes the 9 records that did not have the Ward field filled out.

maxValue <- max(Ward.df$Count)
minValue <- min(Ward.df$Count)

WashDC_WARD <- readOGR(dsn = "./mapping/.", layer = "Ward_-_2012")
WashDC_WARD@data$id <- rownames(WashDC_WARD@data)
WashDC_WARD@data <- join(WashDC_WARD@data, Ward.df, by = "WARD")
WashDC_WARD.points <- fortify(WashDC_WARD, region = "id")
WashDC_WARD.df <- join(WashDC_WARD.points, WashDC_WARD@data, by = "id")

str(WashDC_WARD.df)
# 'data.frame': 22755 obs. of  21 variables:
#   $ long      : num  -77 -77 -77 -77 -77 ...
# $ lat       : num  38.9 38.9 38.9 38.9 38.9 ...
# $ order     : int  1 2 3 4 5 6 7 8 9 10 ...
# $ hole      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
# $ piece     : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
# $ id        : chr  "0" "0" "0" "0" ...
# $ group     : Factor w/ 8 levels "0.1","1.1","2.1",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ OBJECTID  : int  1 1 1 1 1 1 1 1 1 1 ...
# $ WARD      : int  8 8 8 8 8 8 8 8 8 8 ...
# $ NAME      : Factor w/ 8 levels "Ward 1","Ward 2",..: 8 8 8 8 8 8 8 8 8 8 ...
# $ REP_NAME  : Factor w/ 7 levels "Brianne Nadeau",..: 6 6 6 6 6 6 6 6 6 6 ...
# $ WEB_URL   : Factor w/ 7 levels "http://dccouncil.us/council/brianne-nadeau",..: 7 7 7 7 7 7 7 7 7 7 ...
# $ REP_PHONE : Factor w/ 7 levels "(202) 724-8028",..: 7 7 7 7 7 7 7 7 7 7 ...
# $ REP_EMAIL : Factor w/ 7 levels "[email protected]",..: 6 6 6 6 6 6 6 6 6 6 ...
# $ REP_OFFICE: Factor w/ 8 levels "1350 Pennsylvania Ave, Suite 102, NW 20004",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ WARD_ID   : Factor w/ 8 levels "1","2","3","4",..: 8 8 8 8 8 8 8 8 8 8 ...
# $ LABEL     : Factor w/ 8 levels "Ward 1","Ward 2",..: 8 8 8 8 8 8 8 8 8 8 ...
# $ AREASQMI  : num  11.9 11.9 11.9 11.9 11.9 ...
# $ Shape_Leng: num  28714 28714 28714 28714 28714 ...
# $ Shape_Area: num  3.1e+07 3.1e+07 3.1e+07 3.1e+07 3.1e+07 ...
# $ Count     : int  21216 21216 21216 21216 21216 21216 21216 21216 21216 21216 ...

# This will be to just put in the Ward Numbers on the map for simple reference
CoGdf <- data.frame()

CoGdf <- matrix(as.numeric(c('1', '-77.03033', '38.92642', 
                             '2', '-77.045', '38.89', 
                             '3', '-77.08', '38.93', 
                             '4', '-77.03645', '38.97', 
                             '5', '-76.98', '38.93', 
                             '6', '-76.99', '38.89', 
                             '7', '-76.94', '38.89', 
                             '8', '-77', '38.83922')), ncol = 3, byrow = T)
colnames(CoGdf) <- c('WARD', 'long', 'lat')
CoGdf <- as.data.frame(CoGdf)

WashDC_WARD_plot <- ggplot(WashDC_WARD.df, aes(long, lat, group = group, fill = Count)) +
  geom_polygon() +
  geom_path(color = 'white') +
  coord_equal() +
  scale_fill_gradientn(colours=rev(x = rainbow(4)),
                       breaks= c(0,seq(0,maxValue,ceiling(maxValue/5)),maxValue),
                       na.value = 'black',
                       space = 'Lab',
                       guide = 'colourbar') +
  annotate("text", 
           x = CoGdf$long, 
           y = CoGdf$lat, 
           label  = CoGdf$WARD, 
           size = 5) +
  labs(x = 'Longitude', y = 'Latitude', 
       title = 'Washington D.C.\nDensity of Criminal Activity From 2011-2015 By Ward', 
       fill = 'Total Activity') +
  theme(panel.background  = element_rect(fill = 'gray20'), 
        panel.border  = element_rect(linetype  = 'solid', fill = NA), 
        panel.margin  = unit(0.2, 'lines'), 
        strip.text  = element_text(), 
        strip.background  = element_rect(linetype = 'solid', color = 'black'), 
        axis.text  = element_text(color = 'black'), 
        axis.ticks.length  = unit(0, "cm"))

ggsave('./images/WashDC_WARD_plot.png', WashDC_WARD_plot, width  = 8, height  = 8)

As we can see, most of the criminal activity is in a particular area. In this follow-on analysis, we could dive into what types of offenses, as well as when the activity occurs. This is just a started in the exploration.

Conclusion

You should be able to see the different methods you can use to explore the data. We did not go into the analysis, but merely gained a general understanding of what was and/or was not available.

The exploration stage will help your understanding of the data going into the analysis. It gives you a background of the data and what to begin testing out and analyzing. Your exploration may take up some time. Truth be told, some of what you gain from it is part of the analysis of the dataset.

After completing the exploration stage, you should be able to modify or clarify your project objectives. This may temper your ambitions of what you can and cannot do with the data. You may also find that you create additional hypotheses to test and experiment with in the analysis.