Exploring Mapping Data
My next exploration task is the mapping/geospatial tables in my database. I need to assess the what each dataset contains. To aid in the exploration of spatial data, I will demonstrate plotting of spatial points and polygons. This can be a lot easier to looking at this particular structure in tabular or aggregated form. I would like to assess what additional came with each set of data that could be used to further contextualize the city and criminal activity. This post will also need to assess what, if any, needs to be processed or transformed.
Data Preparation
There are several spatial datasets I collected during my data search phase. First, I will look at the polygons sets, then I will explore the points. If you followed the previous post, you will notice that I have added a new function, bulk_table_connections
, in this post to simplify the coding process. The result of the function produces a list of two items, a list of column names and list of table connections.
Assign the table names to a variable
db_tables <- c('charter_school_points',
'dc_polygon',
'metro_bus_stop_points',
'metro_station_points',
'police_district_polygons',
'police_station_points',
'psa_polygons',
'public_school_points',
'summer_public_points',
'transform_school_points',
'ward_polygons',
'zipcode_polygons',
'building_permits',
'construction_permits')
db_tables_polygons <- db_tables[str_detect(string = db_tables, pattern = "polygon")]
db_tables_points <- db_tables[str_detect(string = db_tables, pattern = "points|permits")]
# Get the column names for each table and establish a connection to each data for querying the data
db_polygons <- bulk_table_connections(connection = pg_connect(),
db_tables = db_tables_polygons)
db_points <- bulk_table_connections(connection = pg_connect(),
db_tables = db_tables_points)
Overview of the Polygon Datasets
First I will look at the column names for each of the polygon-related tables. The next post will cover the points-related datasets in continuation of this post.
DC polygon
The dc_polygon
table is small and contains only one polygon feature since it is the covers the entire district. The data associated looks as follows:
The output contains basic information about the spatial feature, mostly area of the polygon (in kilometers and miles), web URL and a reference to the stored geometry in the database. To get the geometry data I can use the following code.
# Get the geometry data from the database
poly_plot_data <- rpostgis::pgGetGeom(conn = pg_connect(),
name = db_tables_polygons[1],
geom = "geometry")
# Convert into sf object
data = st_as_sf(poly_plot_data)
ggplot(data) +
geom_sf() +
geom_sf_label(aes(label = CITY_NAME)) +
labs(x = 'Lon', y = 'Lat')
DC Police Districts
The police_district_polygons
table is little larger with seven polygon features describing each of the Police Districts in the city. The data associated looks as follows:
Again, the output is just basic information about the spatial features, except this doesn’t show the area of each polygon.
Next using the same flow as the previous example we will plot the polygons.
poly_plot_data <-
rpostgis::pgGetGeom(conn = pg_connect(),
name = db_tables_polygons[i_element],
geom = "geometry") %>%
st_as_sf(poly_plot_data)
ggplot(poly_plot_data) +
geom_sf(aes(fill = NAME)) +
geom_sf_label(aes(label = NAME)) +
labs(x = 'Lon', y = 'Lat')
DC Police Service Areas
The psa_polygons
table has 57 polygon features describing each of the Police Service Areas (PSAs) in the city. While Police Districts give more granularity at the city level, the PSAs split up the districts, providing further granularity. This is important to break-down the city criminal activity data so we can compare and contrast among other PSAs as well as provide further activity context. The data associated looks as follows:
Again, the output is just basic information about the spatial features, except this doesn’t show the area of each polygon, again. We can always calculate the area since we are given the polygons if we really needed that level of detail. You could use the area to determine activity per unit area as some metric of comparison.
Next using the same flow as the previous example we will plot the polygons. This time, the color denotes the associated district that the PSA belongs to.
ggplot(poly_plot_data) +
geom_sf(aes(fill = factor(DISTRICT))) +
geom_sf_label(aes(label = NAME)) +
labs(x = 'Lon', y = 'Lat')
DC Wards
The ward_polygons
table has 8 polygon features describing each of the wards in the city. The data associated with the polygons contain a lot of supplemental information, as you can see in the images below of the output. There are 83 columns and most have text datatypes, even though at first glance we might think they should be numeric. The data describes, the population by age, demographics, education. If we recall this data is from 2012 so it is not the most current, it does give us some general idea of the distributions around the city.
After looking through the column values, columns 15 through 78 that contain “PCT_”, “MEDIAN_”, and “_RATE” should be double/numeric values, while the remaining are integer values. With the following code you can also inspect the fields in the R viewer function rather than in the console output, which is constrained.
# double/numeric valued fields
cols_numeric_eval %>%
select(contains(c("MEDIAN_", "_RATE", "PCT_"))) %>%
view()
# int valued fields
cols_numeric_eval %>%
select(!contains(c("MEDIAN_", "_RATE", "PCT_"))) %>%
view()
Next using the same flow as the previous example we will plot the polygons. I could also plot each of the 63 population descriptive columns to provide a visual way to see how each of the wards compare to each other. Below is the code to generate both sets, but I will only display the first.
# Image displayed below
ggplot(poly_plot_data) +
geom_sf(aes(fill = NAME)) +
geom_sf_label(aes(label = NAME)) +
labs(x = 'Lon', y = 'Lat')
# Demonstrated algorithm to generate a plot for each descriptive field
for(i in 15:78){
tmp_col_name <- colnames(poly_plot_data[i])[1]
if(str_detect(string = tmp_col_name, pattern = "((MEDIAN|PCT)_)|(_RATE)")){
poly_plot_data[[i]] %<>% as.numeric()
}else{
poly_plot_data[[i]] %<>% as.integer()
}
ggplot(poly_plot_data) +
geom_sf(aes_string(fill = tmp_col_name)) +
geom_sf_label(aes(label = NAME)) +
labs(x = 'Lon', y = 'Lat')
ggsave(filename = paste0(dir_destination,
'dc_ward_polygons_',
tmp_col_name,'.png'),
width = 6, height = 6,
dpi = "screen",
scale = 1.5)
}
DC Zip Codes
The zipcode_polygons
table has 170 polygon features describing each of the zip codes in the city. This may be too much granularity but there are some interesting analysis you could do with this as well. For the most part you can see that the city has relatively few zipcodes until we get to federally controlled areas of the district (context from living there). The data associated looks as follows:
This much cleaner than the Ward dataset above but still have some useful information about the population.
Since there are so many zip codes, I commented out the label function so they labels would not clutter the graphic.
ggplot(poly_plot_data) +
geom_sf() +
# geom_sf_label(aes(label = NAME)) +
labs(x = 'Lon', y = 'Lat')
Exploring the descriptive fields, we can see the following graphical distributions for uninsured population, MEDICAID recipient population and total population as of 2010 by zip code.
What’s Next?
In the next post I will cover the spatial points datasets mentioned at the beginning of this post.
Posts in Project Series
- Criminal Analysis: Planning
- Criminal Analysis: Data Search (part 0)
- Criminal Analysis: Data Search (part 1)
- Criminal Analysis: Data Search (part 2)
- Criminal Analysis: Data Search (part 3)
- Criminal Analysis: Data Storage
- Criminal Analysis: Data Storage (part 2)
- Criminal Analysis: Data Search (part 4)
- Derive a Star Schema By Example
- Criminal Analysis: Data Exploration
- Criminal Analysis: Data Exploration (part 1)
- Criminal Analysis: Data Exploration (part 2a)
- Criminal Analysis: Data Exploration (part 2b)
- Criminal Analysis: Data Storage (Part 3)