Criminal Analysis: Data Exploration (part 1)

Exploring Criminal Activity Data

My primary data set to explore is the crime table in my database. I need to assess the values and consistency across each of the fields available in what was provided by the source. It is possible for data to be a little inconsistent across each of the years over the past decade, especially if the source providers have not tidied up their data. I will need to assess what needs to be processed or transformed as well. This process will likely lead to spatial and temporal reference tables in my database to standardize values across all my data sets.

Project Planning

Now we will start focusing on the “Questions” portion of the project plan. Initially, I recorded several questions at the start of the project. Throughout this data exploration phase, I will be reflecting back on my original questions, while also adding more as a result of getting a better idea of the various distributions. The exploration process may lead me to also modify my original questions.

Metadata Review

First, I wanted to get the metadata from the source to understand the data contained in each field. I wrote an R script to pull the metadata from the 2015 metadata page (here) and structured the information into a CSV.

This review of the metadata provides context to the abbreviated field label, where the definition of the field is derived and a brief description. When I look at the 3 date fields (“REPORT_DAT”, “START_DATE” and “END_DATE”), I can understand what each represents specifically, even if some fields are more self explanatory than others.

For my analysis, I should be looking to use the “START_DATE” field. I could possibly assess the duration of a crime per the start and end date of the documented crime. I would be skeptical of how valuable or accurate this calculation will be. The report date describes when the report was made. This may different from when the actual event occurred though.

We get a URL reference for more information on the criminal offense definitions used. The Shift field description describes when (i.e. hours) each shift is during the day.

Looking at the coordinate fields (latitude/longitude and X/YBLOCK), the data has two different coordinate systems. For this project, I will only be interested in the latitude and longitude fields in decimal degrees. I think I can also ignore the BLOCK and BLOCK GROUP fields as well, though I’ll keep it in mind.

For more information, you can reference my script and CSV on GitHub.

Initial Overview and Distribution of Data

Now we will look at the distributions of each column as well as to assess formatting.

Queries and Aggregates

In R, I will pull the data from the project_crime database. When you follow the script, and perform the same steps, you’ll notice that the data is not in-memory the same way you might read in CSVs to your environment. Your R environment will show that you only have 51.5 KB of data in the assigned variable. This is because we established a table connection to the Postgres database and didn’t actually move/collect the data into R. For more information on my pg_connect() function seen in the code below, please reference my project_functions.R script on GitHub.

# Using the dbplyr library to interact with the database
# This retrieves the our table in the database
 crime <- tbl(pg_connect(), db_table)

Issue with Running Parameterized Queries

After trying to use the dplyr/dbplyr grammar to group_by and summarise using parameters, I have defaulted to writing the parameterized SQL statements and passing them to the database to fetch the results. When I figure out how to parameterize and get the appropriate results, I’ll update my R scripts. If you are not using parameters, continue using the crime variable in conjunction with the dplyr/dbplyr grammar.

To fetch the results for in-memory operations, use %>% collect() at the end of your piped commands. If you don’t the results will be virtualized as a table connection to the database.

See the “Non-Parameterized Query Alternative” section at bottom for examples of that style of code being applied.

In the first command, I’m filtering out the columns I don’t want to evaluate from my list of column names. I will return to evaluating the other fields in later sections. Next, I construct a SQL statement using parameters to fetch aggregate tables from the database. This pushes the work to the database rather than processing and duplicating all the data in-memory.

query_parameters <- table_col_names[
   !(table_col_names %in% c('XBLOCK', 'YBLOCK', 
                            'START_DATE', 'END_DATE', 
                            'REPORT_DAT',
                            'OCTO_RECORD_ID', 'CCN',
                            'LATITUDE', 'LONGITUDE'))
 ]

crime_table_summaries <- 
   lapply(X = query_parameters, 
          FUN = function(X){
            query_string <- paste0('SELECT "', X,'", ',
                                   'COUNT(*) as CNT ',
                                   'FROM ', db_table, ' ',
                                   'GROUP BY "', X,'";')
            # print(query_string)
            RPostgres::dbGetQuery(conn = pg_connect(),
                                  statement = query_string)
          })

Here is a sample of one of the tables in the output list. Just an aggregate of events by WARD in this instance.

With the summary tables in my crime_table_summaries variable, I will look at the amount of unique values per field before deciding to look at the distribution of all fields. The results (imaged below) tell me I should maybe limit my charts to those under 100 unique values. We don’t want to overcrowd the x-axis. We can always review and inspect the tables without visualizing them as well. You will notice that the OBJECTID field has 405086 unique values. This matches up to our total record count and gives some perspective when looking at the other distinct values for each field.

column_sizes <- 
  sapply(1:length(crime_table_summaries), 
          function(x) nrow(crime_table_summaries[[x]])) %>% 
  tibble()
 column_sizes %<>% mutate(names = query_parameters)

Value Distributions

Now we can produce simple data visualizations to see how the number of criminal activity records are distributed across the unique values per field (those with less than 100). The image gallery below the code section provides a view of what each qualifying field looks like.

plot_list <-    
lapply(1:length(query_parameters),           
function(x) {
            tmp_df <- 
              crime_table_summaries[[x]] %>% tibble 

        if (nrow(tmp_df) < 100){
          return(
            ggplot(data = tmp_df,
                    aes(y = as.numeric(cnt))) + 
             geom_point(aes_string(x = colnames(tmp_df)[1])) + 
            theme(axis.text.x = element_text(angle = 90, 
                                             vjust = 0.5, 
                                             hjust = 1)) + 
             labs(y = "Count") + 
             ggtitle(label = paste0('Criminal Activity by ',
colnames(tmp_df)[1])))
        }else{
          return("Plot axis too large (>100)")
        }
      })

Advisory Neighborhood Commission Identifier

With not much rigor, I can see were crime is occurring more frequently for the spatial-related grouping fields, what type of crime is present and when that crime takes place by shift. These charts can also show where I have have NA(missing) values. These visuals are crude because I’m looking to get a high-level overview and am not interested in polishing them up or in the weeds at this point.

I do want to take another look at the Business Improvement District plot, except I’m going to remove NA values since the volume of NAs was so large it dwarfed the other field counts.

For more information about the exact script used, please refer to the GitHub entry for this post.

Report ID and CCN Overview

Now I want to review the record identifiers to see if there are any duplicates or events that have multiple crimes associated. First, I will perform the same process as before to aggregate the data by field. Then, I will assess the number of unique records.

query_record_parameters <- c('OBJECTID', 'OCTO_RECORD_ID', 'CCN')
 crime_table_summaries_admin <- 
   lapply(X = query_record_parameters, 
          FUN = function(X){
            query_string <- paste0('SELECT "', X,
                                   '", COUNT(*) as CNT ',
                                   'FROM ', db_table, ' ',
                                   'GROUP BY "', X,'";')
            RPostgres::dbGetQuery(conn = pg_connect(),
                                  statement = query_string)
          })
 column_sizes_admin <-   sapply(1:length(crime_table_summaries_admin),           function(x) nrow(crime_table_summaries_admin[[x]])) %>% 
   tibble() %>% 
   mutate(names = query_record_parameters)

The code produced the following:

We can see that “OBJECTID” and “OCTO_RECORD_ID” have distinct counts equal to the total number of records. The “CCN” field does not, so I need to investigate what is going on.

Below is the code to filter the “CCN” summary table to get the records with multiple record associations. In the table (image), there is one CCN whose value is related to 12 incidents, while there is one with four, six with three and the remaining 96 have two. These records will need closer inspection to evaluate why there are multiple records per Criminal Complaint Number (CCN).

ccn_dups <- 
   crime_table_summaries_admin[[3]] %>% 
   tibble %>% 
   filter(cnt > 1) %>% 
   arrange(desc(cnt))

CCN Inspection with Multiple Records

First, lets see what the “OCTO_RECORD_ID” values are for CCN = 13132784.

query_ccn <- paste0('SELECT "OCTO_RECORD_ID" ',
                     'FROM crime ',
                     'WHERE \'13132784\' = "CCN";')
recid_ccn_13132784 <- 
   RPostgres::dbGetQuery(conn = pg_connect(),
                         statement = query_ccn)

Rather than submit query after query to the database, I’ll just pull all the fields for the desired CCN. In the code below you can see that I’m selecting all except two fields that I know are going to be unique and then taking the distinct data from that.

query_ccn_all <- paste0('SELECT * ',
                     'FROM crime ',
                     'WHERE \'13132784\' = "CCN";')
 recid_ccn_13132784 <- 
   RPostgres::dbGetQuery(conn = pg_connect(),
                         statement = query_ccn)

# Check the distinct values minus the two fields with known uniqueness
 recid_ccn_13132784_all %>% 
   select(-c("OBJECTID", "OCTO_RECORD_ID")) %>% 
   distinct

The resultant table tells us that we really have one distinct event. To do the due diligence, I decided to do some research to see if there was an event that had 12 homicides. The data provides all the information to formulate a query, especially time and location.

RESEARCH

After a quick query (“Washington+DC+Homicides+2013+September+16”), I got the following event which validates data.

After seeing the result, I remembered the event. I was working in the Washington DC area at the time. Given the result, I could probably research each event. Instead I’ll do a quick check to see what the offense(s) was/were for each of the CCNs.

Other Duplicate CCN Events

I created a view in my database to hold the CCN’s with more than one record associated. From there I joined it with the crime table. The resultant table provides a list of the CCNs with their associated offenses and count of records.

query_ccn_dups <- 
   paste0('SELECT ',
          '"CCN" as ccn, ',
          '"OFFENSE" as offense, ',
          'COUNT(*) as cnt ',
          'FROM ', db_table, ' ',
          'inner join (select * from view_gt2_recid_per_ccn) as tb2 ',
          'on "CCN" = ccn ',
          'GROUP BY "CCN", offense ',
          'ORDER BY cnt desc;')
 ccn_dups_all <- 
   RPostgres::dbGetQuery(conn = pg_connect(),
                         statement = query_ccn_dups)

The images below show the results. You can see in the second image that a couple CCNs are tied to multiple offenses.

Without digging into each event, I will have to assume that multiple offenses, whether they are the same offense or not, are unique crimes and should counted as such.

Date-Time Overview

Finally, I will review the date fields to see if I need to split out or reformat the date/time structures. As a side note, the date/time fields were stored in the database as text. This should be changed at a minimum to simplify querying.

query_date_parameters <- c('START_DATE', 'END_DATE', 'REPORT_DAT')
 crime_table_summaries_dates <-
   lapply(X = query_date_parameters,
          FUN = function(X){
            query_string <- paste0('SELECT "', X,'", ',
                                   'COUNT(*) as CNT ',
                                   'FROM ', db_table, ' ',
                                   'GROUP BY "', X,'";')
            # print(query_string)
            RPostgres::dbGetQuery(conn = pg_connect(),
                                  statement = query_string)
          })
 column_sizes_dates <-   sapply(1:length(crime_table_summaries_dates),          function(x) nrow(crime_table_summaries_dates[[x]])) %>%
   tibble() %>%
   mutate(names = query_date_parameters)

From the table we can see that there are many duplicates in each column. Of the 405086 records in the crime table. there are 404964 unique CCNs. We can compute the percents of uniqueness.

column_sizes_dates %>% 
   mutate(percent_total = column_sizes_dates$. / 405086,
          percent_unique_ccn = column_sizes_dates$. / 404964)

Start Date Profile

In the image below we can see the start dates ordered by count. There were 20 events that occurred on 2015-08-23 at 8PM. It is likely that this is a rounded number and we should not expect all events to have specific times. What I can do with this is drill-down only as far as the hour value for any analysis, which should be sufficient to look at activity by hour of day.

In the view we can see on line 6 that there are 10 events that occurred in 1970. This is an obvious error for a value of zero (0 = 1970-01-01 00:00:00). I will do an aggregate on year to see what other oddities there are in the data. Before I do that, I’ll mutate the date as text into a timestamp value. Then I will extract the year, month, day and hour at the minimum. For each of the extracted fields I will perform aggregates. I will also check to see if the date format is consistent for all records.

Check Format

To check for formatting, I will detect if there is a “-” in the date, which could throw off a specified format. At first glance the format looks standard in a “%Y/%m/%d %H:%M:%S” format.

After running the following code, I get a “FALSE” responses for each. This tells me there are no NA or NULL values as well as possibly alternate format. This simplifies the next step of parsing and extracting.

start_date_profile  %>% select(1) %>% 
   pull %>% str_detect("-") %>% any()
start_date_profile  %>% select(1) %>% 
   pull %>% is.na() %>% any()
start_date_profile  %>% select(1) %>% 
   pull %>% is.null() %>% any()

Parse and Extract

In this section, I’ll demonstrate parsing the date-time and extracting information using the package lubridate (link to official documentation). Lubridate is part of the Tidyverse ecosystem and makes date/time operations very easy and not as cumbersome as it used to be. The functions are pretty much a 1:1 of what I want to extract and what the function is called. For month and weekday, I added the label = T to display the three character name values.

start_date_profile  %<>% 
   mutate(datetime = lubridate::ymd_hms(START_DATE),
          date = lubridate::date(START_DATE),
          year = lubridate::year(START_DATE),
          month = lubridate::month(START_DATE, label = T),
          day = lubridate::day(START_DATE),
          weekday = lubridate::wday(START_DATE, label = T),
          weeknum = lubridate::week(START_DATE),
          quarter = lubridate::quarter(START_DATE)) %>% 
   mutate(hour = datetime %>% lubridate::hour())

Also notice that I used a second mutate function in my pipe. This is because I wanted to mutate off of a value I just created and would not be accessible inside the first mutate.

Aggregates

Years

First I will look at years. Ideally, I should get 2009-2020. As we can see in the images below, I have 23 distinct years, accounting for 363 distinct dates from the overall 340844 (~0.1%). This is pretty small, but I should still remove those items from my analysis.

# Yearly aggregates before the project timeframe
 start_date_profile  %>% 
   filter(year < 2009) %>% 
   group_by(year) %>% 
   arrange(year) %>% 
   summarise(cnt = n()) %>% 
   view()

# Total number of items outside the project timeframe
 start_date_profile  %>% 
   filter(year < 2009) %>% 
   nrow()

# Yearly aggregates during the project timeframe
 start_date_profile  %>% 
   filter(year >= 2009) %>% 
   group_by(year) %>% 
   arrange(year) %>% 
   summarise(cnt = n()) %>% 
   view()

Months

Next, I will look at months. From this point forward I will filter out all records before 2009. Since months are more bounded than years we get the correct and expected response.

start_date_profile_2009_2020 <- 
   start_date_profile  %>% 
   filter(year >= 2009)  

start_date_profile_2009_2020 %>% 
   group_by(month) %>% 
   summarise(cnt = n()) %>% 
   qplot(data = ., x = month, y = cnt)

Days

Next, I will look at days of the month and week. Again this is bounded to 1-31 and Mon-Sun respectfully. As we can see in the image below there is much less activity on the 31st of the month. This makes sense since only 7 months have a 31st day. We can see from a high-level activity increases slightly from the beginning to end of a general month. In the day of the week graphic we can see that activity is much higher on Friday and Saturday. We will be characterizing all of this in more detail in future analysis posts.

Hours

Finally I will look at hours. I should only get 00 – 23. As we can see in the image below, activity hits a low in the 5th hour (5AM) of the day and peaks in the 18th (6PM).

End Date Profile

Since the end date should bookend the start date, I will leave this section blank, but I will be processing the field similarly using the following code:

end_date_profile <- 
   crime_table_summaries_dates[[2]] %>%
   arrange(desc(cnt)) %>% 
   tibble

# Check formatting
 end_date_profile %>% select(1) %>% 
   pull %>% str_detect("-") %>% any()
 end_date_profile %>% select(1) %>% 
   pull %>% is.na() %>% any()
 end_date_profile %>% select(1) %>% 
   pull %>% is.null() %>% any()

# Parse and extract date/time information
 end_date_profile %<>% 
   mutate(datetime = lubridate::ymd_hms(END_DATE),
          date = lubridate::date(END_DATE),
          year = lubridate::year(END_DATE),
          month = lubridate::month(END_DATE, label = T),
          day = lubridate::day(END_DATE),
          weekday = lubridate::wday(END_DATE, label = T),
          weeknum = lubridate::week(END_DATE),
          quarter = lubridate::quarter(END_DATE)) %>% 
   mutate(hour = datetime %>% lubridate::hour())

While inspecting the values, I thought it would be worth showing the strange end dates present. Some of them look to be actual typos where the wrong key (adjacent) was pressed, or numbers were transposed. I have experienced both cases in many datasets I have processed to do my job. The volume of bad dates are pretty minimal so we can do duration calculations later on.

Report Date Profile

The report date is somewhate independent of the other date fields. It describes the date the offense was reported to Metropolitan Police Department (MPD). I will just process the field similarly using the following code:

report_date_profile <- 
   crime_table_summaries_dates[[3]] %>%
   arrange(desc(cnt)) %>% 
   tibble

# Check formatting
 report_date_profile %>% select(1) %>% 
   pull %>% str_detect("-") %>% any()
 report_date_profile %>% select(1) %>% 
   pull %>% is.na() %>% any()
 report_date_profile %>% select(1) %>% 
   pull %>% is.null() %>% any()

# Parse and extract date/time information
 report_date_profile %<>% 
   mutate(datetime = lubridate::ymd_hms(REPORT_DAT),
          date = lubridate::date(REPORT_DAT),
          year = lubridate::year(REPORT_DAT),
          month = lubridate::month(REPORT_DAT, label = T),
          day = lubridate::day(REPORT_DAT),
          weekday = lubridate::wday(REPORT_DAT, label = T),
          weeknum = lubridate::week(REPORT_DAT),
          quarter = lubridate::quarter(REPORT_DAT)) %>% 
   mutate(hour = datetime %>% lubridate::hour())

Load Date Tables into Database

Now that I have processed all the date fields, I will push the results back into the database. Before loading the data, I’ll fix some loose ends by dropping the count field, sorting the table by date and creating a unique identifier.

start_date_profile %<>% 
   select(-cnt) %>% 
   arrange(date) %>% 
   rowid_to_column(var = "date_id")

end_date_profile %<>% 
   select(-cnt) %>% 
   arrange(date) %>% 
   rowid_to_column(var = "date_id")

report_date_profile %<>% 
   select(-cnt) %>% 
   arrange(date) %>% 
   rowid_to_column(var = "date_id")

Now I’m loading the tables into the database.

dbWriteTable(conn = pg_connect(),
              name = "crime_start_date",  
              value = start_date_profile %>% as.data.frame(.))
 dbWriteTable(conn = pg_connect(),
              name = "crime_end_date",  
              value = end_date_profile %>% as.data.frame(.))
 dbWriteTable(conn = pg_connect(),
              name = "crime_report_date",  
              value = report_date_profile %>% as.data.frame(.))

Comparison of Start Date and Report Date

In this snippet, I will just take a look at the amount of records by year for the “START_DATE” and “REPORT_DAT” fields.

query_string <- 
   paste0(
     'select ',
     'date_part(\'year\', to_date("START_DATE", \'YYYY/MM/DD\')) as sd_yr, ',
     'date_part(\'year\', to_date("REPORT_DAT", \'YYYY/MM/DD\')) as rd_yr, ',
     'count(*) as cnt ',
     'from crime ',
     'group by sd_yr, rd_yr ',
     'order by 1, 2;'
   )

comparison_results <- 
   RPostgres::dbGetQuery(conn = pg_connect(),
                         statement = query_string)

comparison_results %>% 
   spread(data = ., key = rd_yr, value = cnt, fill = 0)

The final bit of code produces a crosstab (pivot table) of the fields as depicted in the image below. Report date years are across the top of the table (column headers) and the start date years run along the vertical (row headers).

We can see that there is a small portion of offenses that occurred in the year before an event was reported. This could be events at the end of the year that don’t get reported til later or as can been seen there is a delay in some criminal events being reported. As the difference in years grows the number of reported incidents becomes much less as well. What I find odd is that there are events that start in the future, but are being reported before they occurred. This is likely another data entry error issue. The number of incidents is quite small though. As a result, I need to adjust my analysis filtering to exclude records where the start date occurs after the report date. This is just supplemental to our understanding of the data.

Non-Parameterized Query Alternative

Using the data section as an example the following code is what you would use if you wanted to skip parameterizing and using lapply to query for the aggregated results from the database. Both methods work and the size of the data stored in the variable are the same. Using the dplyr/dbplyr grammar is certainly much easier than writing out the SQL strings but its redundant if you are aggregating or processing data over several variables. For one-off operations this would be the preferred method.

start_date_profile <-
 crime %>%
 group_by(START_DATE) %>%
 summarise(CNT = n()) %>%
 collect()
 
 end_date_profile <-
 crime %>%
 group_by(END_DATE) %>%
 summarise(CNT = n()) %>%
 collect()
  
 report_date_profile <-
 crime %>%
 group_by(REPORT_DAT) %>%
 summarise(CNT = n()) %>%
 collect()

To fetch the results for in-memory operations, use %>% collect() at the end of your piped commands. If you don’t the results will be virtualized as a table connection to the database. See the following code and image for comparison:

start_date_profile_no_collect <-
 crime %>%
 group_by(START_DATE) %>%
 summarise(CNT = n()) 

start_date_profile_collect <-
 crime %>%
 group_by(START_DATE) %>%
 summarise(CNT = n()) %>%
 collect()

start_date_profile <- 
   crime_table_summaries_dates[[1]] %>%
   arrange(desc(cnt)) %>% 
   tibble