Exploring Mapping Data
In a continuation from part 2a, this post will explore the spatial points datasets from my database. I need to assess what each dataset contains. To aid in the exploration of spatial data, I will demonstrate plotting spatial points. This can be a lot easier to look at this particular structure in tabular or aggregated form. I would like to assess what additionally came with each set of data that could be used to further contextualize the city and criminal activity. This post will also assess what, if any, needs to be processed or transformed.
Data Preparation
This section will be a repeat from my Part 2a post, just in case anyone skipped that.
There are several spatial datasets I collected during my data search phase. First, I will look at the polygons sets, then(now) I will explore the points. If you followed the previous post, you will notice that I have added a new function, bulk_table_connections
, in this post to simplify the coding process. The result of the function produces a list of two items, a list of column names and list of table connections.
# Assign the table names to a variable
db_tables <- c('police_station_points',
'public_school_points',
'charter_school_points',
'summer_public_points',
'transform_school_points',
'metro_bus_stop_points',
'metro_station_points',
'building_permits',
'construction_permits')
db_tables_points <- db_tables[str_detect(string = db_tables, pattern = "points|permits")]
# Get the column names for each table and establish a connection to each data for querying the data
db_points <- bulk_table_connections(connection = pg_connect(),
db_tables = db_tables_points)
To help provide spatial context, I will overlay the points in each subsequent section on top of the Police Districts Map. I will assign the map data to the police_districts_polygons
variable, which you will see throughout the rest of this post.
police_districts_polygons <-
rpostgis::pgGetGeom(conn = pg_connect(),
name = 'police_district_polygons',
geom = "geometry") %>%
st_as_sf(poly_plot_data)
To simplify some of the processes, I have also created an additional function to print out the initial information and return the simple feature object, as seen in the code block above. The function is called data_exploration_spatial_initial
. The structure is as follows:
i_element <- 1
spatial_data_obj <-
data_exploration_spatial_initial(connection = pg_connect(),
table_name = db_tables_points[i_element],
virtual_data = db_points$tables[[i_element]])
Overview of the Points Datasets
Now I will look at the column names for each of the point-related tables.
DC Police Stations
The police_station_points
table is small with 15 point features. Each police district contains at least 1 police station, with a couple districts having a couple specialized units or substations. The data associated looks as follows:
The output contains basic information about the police stations. Most of the information is contact data. The “NAME” field provides a good description of what each station is, along with specialized liaison units. The “TYPE” field describes whether the location is a station or substation as well. To get the geometry data I can use the following code.
i_element <- 1
point_plot_data <-
data_exploration_spatial_initial(
connection = pg_connect(),
table_name = db_tables_points[i_element],
virtual_data = db_points$tables[[i_element]]
)
# notice the background layer containing the district polygons
ggplot(point_plot_data) +
geom_sf(data = police_districts_polygons,
aes(fill = NAME), inherit.aes = F) +
geom_sf() +
labs(x = 'Lon', y = 'Lat')
DC School Datasets
DC Public Schools
The public_school_points
table contains the locations of 123 public schools throughout the city. The data associated looks as follows:
The output was helpful in learning about the structure and type of schools. Most of the information is contact and location data.
Now I will plot the data and differentiate the point colors by facility usage (FACUSE).
ggplot(point_plot_data) +
geom_sf(data = police_districts_polygons,
fill = 'grey50',
inherit.aes = F) +
geom_sf(aes(color = FACUSE)) +
labs(x = 'Lon', y = 'Lat')
DC Chart Schools
The charter_school_points
table contains information about the 146 chart schools in the city. . The data associated looks as follows:
The output is very similar to what was in the public schools dataset. Now I will plot the data and differentiate the point colors by grade levels (GRADES).
ggplot(point_plot_data) +
geom_sf(data = police_districts_polygons,
fill = 'grey50',
inherit.aes = F) +
geom_sf(aes(color = GRADES)) +
labs(x = 'Lon', y = 'Lat')
As we can see in the image below, charter schools are focused in certain areas, whereas public schools are little more disbursed throughout each Police District. There are also a lot more breakdowns of schools among different groups of grade levels. This likely provides more specificity to particular areas.
DC Summer Public Schools
The summer_public_points
table has 12 school locations in the city. Most of the schools are directed to support elementary level.
Now I will plot the data and differentiate the point colors by facility usage (FACUSE).
ggplot(point_plot_data) +
geom_sf(data = police_districts_polygons,
fill = 'grey50',
inherit.aes = F) +
geom_sf(aes(color = FACUSE)) +
labs(x = 'Lon', y = 'Lat')
The image below shows a sparser set of schools throughout the city, but we have to keep in mind these are summer schools so the demand and offerings are likely limited. During the summer months we can take a look at what is happening near these locations
DC Transformation Schools
The transform_school_points
table has 14 school locations, 1 of which is not used according to the data. The data associated looks as follows:
This set has far fewer fields than the other school-related datasets, though the information provided is certainly sufficient.
ggplot(point_plot_data) +
geom_sf(data = police_districts_polygons,
fill = 'grey50',
inherit.aes = F) +
geom_sf(aes(color = FACUSE)) +
labs(x = 'Lon', y = 'Lat')
This would also be a set to explore what crime occurs in proximity, especially during and pre-/post-school primary hours.
DC Metro Overview
DC Metro Bus Stops
The metro_bus_stop_points
table has 10349 metro bus stop locations. The data associated looks as follows:
This set has a lot of fields, which are coded in both field value and column name. This suggests that I should take a look at the source webpage to learn more about the specifics about the fields. You can view the source metadata information here. You can also find my script for processing the metadata page on my GitHub here. The output of the processing script produces the following CSV:
When reading through the metadata descriptions, there is lot of descriptive information about each stop, surrounding features and accessibility.
ggplot(point_plot_data) +
geom_sf(data = police_districts_polygons,
fill = 'grey50',
inherit.aes = F) +
geom_sf(aes(color = REG_ID)) +
labs(x = 'Lon', y = 'Lat') +
ggtitle(label = 'Metro Bus Stops')
We can see that the data extends well beyond the boundary of the city in the image below.
To reduce the points to only those metro bus stops in the boundary of the city we can apply the code below. First, I get the dc boundary dataset to base the filter on. Since both objects are sf objects, I can run the function sf::st_join
to get only the points that lay inside the polygon boundary.
dc_boundary <-
rpostgis::pgGetGeom(conn = pg_connect(),
name = 'dc_polygon',
geom = "geometry") %>%
st_as_sf()
point_plot_data_filtered <-
point_plot_data %>%
st_join(dc_boundary, left = F)
rm(dc_boundary)
ggplot(point_plot_data_filtered) +
geom_sf(data = police_districts_polygons,
fill = 'grey50',
inherit.aes = F) +
geom_sf(aes(color = REG_ID)) +
labs(x = 'Lon', y = 'Lat') +
ggtitle(label = 'Metro Bus Stops')
We can see in the image below how the focus is now on the city itself. All the metro bus stops within the city boundary are visible. We can see how the bus stops also outline the roadways throughout most of the city.
DC Metro Stations
The metro_station_points
table has 113 locations. The data associated looks as follows:
This set has far fewer fields than the bus stops dataset.
ggplot(point_plot_data) +
geom_sf(data = police_districts_polygons,
fill = 'grey50',
inherit.aes = F) +
geom_sf(aes(color = LINE)) +
labs(x = 'Lon', y = 'Lat') +
ggtitle(label = 'Metro Stations')
Since metro entrances are funnel points and areas with high concentrations of people throughout hours of operation,this will be interesting to explore as they relate with proximity to criminal activity and time of day.
Permits
Building Permits
The building_permits
table has 504782 permits The data associated looks as follows:
This dataset contains temporal and spatial dimensions. It covers permit information from 2009-2020, which is why there are so many records. I will view the records in a crosstab by year and permit type
point_plot_data_reduced <-
point_plot_data %>%
select(ISSUE_DATE, PERMIT_TYPE_NAME, PERMIT_SUBTYPE_NAME,
PERMIT_CATEGORY_NAME, WARD, DISTRICT, PSA) %>%
mutate(year = lubridate::year(ISSUE_DATE)) %>%
filter(APPLICATION_STATUS_NAME == 'PERMIT ISSUED')
point_plot_data_reduced %>%
st_drop_geometry() %>%
select(year, PERMIT_TYPE_NAME) %>%
group_by(year, PERMIT_TYPE_NAME) %>%
summarise(count = n()) %>%
pivot_wider(names_from = PERMIT_TYPE_NAME,
values_from = count)
In the crosstab above we can seen how the permits breakdown by year and category. The supplemental should be further explored using the subtype field.
point_plot_data_reduced %>%
st_drop_geometry() %>%
filter(PERMIT_TYPE_NAME == "SUPPLEMENTAL") %>%
select(year, PERMIT_SUBTYPE_NAME) %>%
group_by(year, PERMIT_SUBTYPE_NAME) %>%
summarise(count = n()) %>%
pivot_wider(names_from = PERMIT_SUBTYPE_NAME,
values_from = count)
Lets look at
point_plot_data_reduced %>%
st_drop_geometry() %>%
filter(PERMIT_TYPE_NAME == "SUPPLEMENTAL") %>%
select(year, PERMIT_SUBTYPE_NAME) %>%
group_by(year, PERMIT_SUBTYPE_NAME) %>%
summarise(count = n()) %>%
pivot_wider(names_from = PERMIT_SUBTYPE_NAME,
values_from = count)
Since there is still a lot of data to plot, I will filter down to the 7th district in 2020 and focus on supplemental permit types and display the permit subtype.
bp_2020_supplemental <-
point_plot_data_reduced %>%
filter(DISTRICT == "SEVENTH",
PERMIT_TYPE_NAME == "SUPPLEMENTAL",
year == 2020)
ggplot(bp_2020_supplemental) +
geom_sf(data = police_districts_polygons %>%
filter(DISTRICT == 7),
inherit.aes = F) +
geom_sf(aes(color = PERMIT_SUBTYPE_NAME)) +
labs(x = 'Lon', y = 'Lat') +
ggtitle(label = 'Building Permits')
ggplot(bp_2020_supplemental) +
geom_sf(data = police_districts_polygons %>%
filter(DISTRICT == 7),
inherit.aes = F) +
geom_sf(aes(color = PERMIT_SUBTYPE_NAME)) +
labs(x = 'Lon', y = 'Lat') +
ggtitle(label = 'Building Permits') +
facet_grid(year~PERMIT_SUBTYPE_NAME)
The in the graphics below we can see where a subset of building permits are taking place. The facet plot helps to separate out the different permit subtypes so you can see them in a more isolated visual. I will spend a little more time looking at aggregates over time on building permits when I cover economic and housing data later on.
Construction Permits
The construction_permits
table has 134003 permits. The data associated looks as follows:
The “EFFECTIVEDATE” will be the date that should be most desirable to use as it should indicate work is starting. The “EXPIRATIONDATE” could be used to look at duration of each project as well. The fields in this dataset that contain “IS{SOMETHING}” should be boolean datatypes. These fields describe the work being permitted. To look at specifics, the field “WORKDETAIL” provides additional information.
Here I can get a quick view of the various status on permits. The status field provides a filtering mechanism to remove those permits which were cancel/withdrawn or denied. I will explore more aggregates over time during a later post.
db_points$tables[[i_element]] %>%
select(STATUS) %>%
group_by(STATUS) %>%
summarise(count = n()) %>%
collect()
Since there is a lot of data over the 2012-2020 timeframe, I’ll display the “ISPSRENTAL” field, which is likely portable storage rental throughout the city. First I need to add a year field before plotting.
# Add a year column
point_plot_data %<>%
mutate(year = lubridate::year(EFFECTIVEDATE))
ggplot() +
geom_sf(data = police_districts_polygons,
inherit.aes = F) +
geom_sf(data = point_plot_data %>% filter(ISPSRENTAL == "T"),
aes(color = as.factor(year))) +
labs(x = 'Lon', y = 'Lat') +
ggtitle(label = 'Construction Permits',
subtitle = "Portable Storage Rental") +
facet_wrap("year", nrow = 2)
Conclusion
Over the course of Part 2a and Part 2b, you should have some understanding of getting the mapping data from the database, working with it in R and how to plot the spatial features on a map. We used the handy library sf
(simple features) to easily plot the spatial objects.
What’s Next?
The next part in the data exploration sub-series will look at the weather, solar and lunar data. This data will not be spatially oriented, but will look at using time-series plotting to help visualize the data.
Additionally, I will produce a more detailed exploration of the Permits datasets in a later post.
Posts in Project Series
- Criminal Analysis: Planning
- Criminal Analysis: Data Search (part 0)
- Criminal Analysis: Data Search (part 1)
- Criminal Analysis: Data Search (part 2)
- Criminal Analysis: Data Search (part 3)
- Criminal Analysis: Data Storage
- Criminal Analysis: Data Storage (part 2)
- Criminal Analysis: Data Search (part 4)
- Derive a Star Schema By Example
- Criminal Analysis: Data Exploration
- Criminal Analysis: Data Exploration (part 1)
- Criminal Analysis: Data Exploration (part 2a)
- Criminal Analysis: Data Exploration (part 2b)
- Criminal Analysis: Data Storage (Part 3)