Criminal Analysis: Data Search (part 3)

So far I have gathered half of my other data sources. The next up is to get employment and real estate data based on the project plan.

Employment Data

I generally searched for “Washington DC” and “employment data” but later modified it a little based on the results I got.

https://duckduckgo.com/?t=lm&q=washington+dc+employment+data&ia=web
https://duckduckgo.com/?t=lm&q=Washington+DC+unemployment+data+historical+data&ia=web

DC Department of Employment Services

After combing through several websites in the results, I settled on the DC Department of Employment Services (https://does.dc.gov/page/labor-statistics). On this page I saw a link to unemployment data by ward.

This page was good except the results and data were all on PDFs. I was not able to find any tabulated data to download, so I got to work structuring my own spreadsheet with the data.

Structuring and Processing Data

I set up four columns. One for year, month, data and length. After filling in the year and months from Jan-2009 to Dec-2019, I simply had to copy the data in each month section (seen in image above). If for some reason you can’t find the data in a easy format, you need to roll up your selves and get to work. Over my years of doing this work, its a reality. API’s make it much much easier but this also flexes some muscles that sometimes get weak. The length count column in my spreadsheet was a checker to see if my data was about the same each month. Sure enough it stayed around 257-254 characters.

Once I had in the first year of data (one year per pdf), I began writing a script to process the spreadsheet to convert the text value into the tables I would need across the wards, years and months. Preserving all the data was important. Below you can see my process of parsing the data and transforming the structure to get everything into a nice clean tabulated structure that could be saved off for later ingesting into the database.

new_labels <- c('Ward', 'Labor_Force', 'Employment', 'Unemployment', 'Unemployment_Rate')

sample_data <- read_delim(file = '../data/employment/CY2009.txt', delim = '\t')
data_list <- list()
for(i in 1:nrow(sample_data)){
  data_to_parse <- sample_data$Data[i]
  data_to_parse %<>% 
    str_remove(string = ., pattern = "Ward Labor Force Employment Unemployment Rate ") %>% 
    str_replace_all(string = ., pattern = ',', replacement = '') %>% 
    str_split(string = ., pattern = ' ') %>% 
    unlist() %>% 
    as.numeric() %>% 
    matrix(data = ., ncol = 5, byrow = T) %>% 
    data.frame() %>% 
    tibble() %>% 
    mutate(Year = sample_data$Year[i],
           Month = sample_data$Month[i])
  
  data_list[[i]] <- data_to_parse
}
employment <- data_list %>% bind_rows
colnames(employment)[1:5] <-  new_labels
write_csv(x = employment, file = '../data/employment/PROCESSED_sample_data.csv')

A simple plot of the data to show us the data across the wards in the city. This also was a quick method to see if there were any anomalies or errors as a result of my copy/paste process for all the PDF documents and all the separate month blocks on each.

Department of Labor

The next site I found to have some interesting, or at least possibly useful information for analysis with other employment data was https://oui.doleta.gov/unemploy/claims.asp. This page provided unemployment insurance weekly claims data by week. The data is used in analysis of unemployment trends in the nation and state level. They describe the information as, “Initial claims measure emerging unemployment and continued weeks claimed measure the number of persons claiming unemployment benefits.” This should be a good compliment to my unemployment data by ward, though this doesn’t get to that level of granularity.

A simple plot to quickly show the data.

US Bureau of Labor Statistics

The next site provides national unemployment data over my time frame, so i grabbed it. I figured I could use the data to compare against the other data I got specifically for Washington DC.

A simple plot to quickly show the downloaded data.

Real Estate Data

This focus area is still being researched and curated. A separate post will detail more information about real estate data and related economic information to incorporate into the analysis.

As a preliminary introduction, since housing value data is not readily accessible right now, I’m looking into building and construction permits. Although this might not be able to tell me housing value, it could point to where construction, renovations and improvements are taking place. This may help to understand where investments and revitalization efforts are taking place and at what density over time.

I am also looking at various economic data related to households, rentals and home-ownership. The current economic data is not very granular so it will likely be supplemental to other analysis.

Additional Map Data

While searching for property data, I realized that metro bus and train station information could be helpful and add context to my analysis. These features can be used to asks additional questions. From this data would could detail the type, frequency and time distributions of crime exist in proximity of these fixed locations. These features have also been updated on my project plan.

Planning Progress

I will reference back to my plan and update what we have so far. Using XMind’s icons, I put task completion status next to each data item to indicate level of progress. I have expanded a couple areas to add some granularity. If during the process I expand or modify my plan, I should make sure my plan reflects those changes. Project documentation is a good skill to have. If you document properly as you go, it will save you time and patience later.

Criminal Analysis: Data Search (part 3)

Employment Data

DC Department of Employment Services

Structuring and Processing Data

Department of Labor

US Bureau of Labor Statistics

Real Estate Data

Additional Map Data

Planning Progress

GitHub Link

Posts in Project Series

Recommended R Books

Recommended Python Books