*** Updated Code on GitHub ***
In this post I will be covering spatial visualizations using data from the state of Wyoming. For structure sake, I will be walking through each of the step outlined in my Analytic Methodology workflow. We will identify the sources of data, the processes to analyze and visualize the data.
For several weeks I have been following Cowboy State Daily, a local news and reporting outlet in the state of Wyoming. I have been observing their reporting of the COVID-19 and thought I could help create an alternative map that they use as well as some other visualizations that could accompany future reporting.
The purpose of this post is not to criticize the original source, but to explore how to build up a spatial visualization from the ground up. Once the foundation of the visualization is composed, we can easily add more content and data to help convey spatial information in a visually appealing manner.
Their most recent report on 17-July-2020 is here: https://cowboystatedaily.com/2020/07/16/wyoming-sees-39-new-confirmed-coronavirus-cases-recoveries-grow-by-34/
This is their current visual:
Planning
When I look at the image above, I can see what foundational data we will need to create the underlying map. First we need to spatial reference data for the state of Wyoming with specificity to the County boundaries. This could entail looking for shapefiles, geoJSON, KML or csv files with the appropriate details.
Next, we will need to get some data to give the visualization some life and meaning. For this example, I have extracted the numbers from the article mentioned above. The fidelity of the data in terms of time and space are the only limitations to what we could create to tell a more contextualized story.
Data Search
After some querying around for “Wyoming”, “county”, “spatial data” and “shapefiles” I found several a useful resources at the following:
- http://www.geospatialhub.org/
- https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
- http://www.uwyo.edu/wygisc/geodata/index.html
- https://catalog.data.gov/dataset/county-boundaries-wyoming-county-shapefile-1998
I have initially explored https://www.geospatialhub.org/ for this particular example, but this list of sources provide additional data that could also be used. Geospatial Hub is a repository with various spatial data for Wyoming and is publicly available. Under the “Administration and Political Boundaries” section, i was able to find what I needed here:
https://www.geospatialhub.org/datasets/b0e0a99ec14748eeae750949c7bbb2ec_0?geometry=-123.671%2C40.158%2C-91.437%2C45.779
If we wanted to analysis and visualize other layers of data, this is where we would search for and explore what is available at whatever spatial granularity we could find.
Data Storage
Looking at the page shown above we have the option to download the data or get the API. The following menus show you the various options. For my example, I downloaded the “Full Dataset Shapefile” and extracted the zip file to a folder on my local storage device.
Inside the folder you should see the following if you are performing the same actions.
Data Preparation
Now that we have our data, we are going to explore how to interact with the shapefile in R.
First we load the various packages I think we will need for the remainder of this tutorial. If you dont have these libraries, you can install them using the install.packages("packagename")
command.
library(tidyverse)
library(magrittr)
library(rgdal)
library(rgeos)
library(maptools)
library(geosphere)
library(GISTools)
We will now read in the desired shapefile “DOR_-_Counties.shp”.
wyoming <- readShapeSpatial(fn = "./ProblemXSolutions.com/Wyoming/wyo_counties/DOR_-_Counties.shp")
Data Exploration
Once the data is loaded, we should explore the structure. Using the slotNames(wyoming)
function, we can see the high-level structure of the object. In this case the object wyoming
contains 5 slots. Using the str(wyoming)
function, we can see that the structure of wyoming
.
The first slot is data and is a formally classed a SpatialPolygonsDataFrame. We can access this by using wyoming@data
. This dataframe provides metadata information about the “@polygon” slot spatial features. Each of the 23 observation correspond to the 23 polygons, which represent each of the counties in the state of Wyoming. The names(wyoming)
provides use the variable names of the “@data” table.
slotNames(wyoming)
# [1] "data" "polygons" "plotOrder" "bbox" "proj4string"
str(wyoming)
# Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
# ..@ data :'data.frame': 23 obs. of 8 variables:
# .. ..$ OBJECTID : int [1:23] 1 2 3 4 5 6 7 8 9 10 ...
# .. ..$ Shape_Leng: num [1:23] 661865 467223 569470 947920 718621 ...
# .. ..$ Shape_Area: num [1:23] 2.60e+10 1.23e+10 1.30e+10 4.87e+10 2.00e+10 ...
# .. ..$ COUNTYNAME: Factor w/ 23 levels "Albany","Big Horn",..: 13 11 17 19 1 4 8 16 2 7 ...
# .. ..$ YEARCOLLEC: int [1:23] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
# .. ..$ TAXYEAR : int [1:23] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
# .. ..$ TYPENAME : Factor w/ 1 level "County Boundary": 1 1 1 1 1 1 1 1 1 1 ...
# .. ..$ RuleID : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
# .. ..- attr(*, "data_types")= chr [1:8] "N" "N" "N" "C" ...
# ..@ polygons :List of 23
# .. ..$ :Formal class 'Polygons' [package "sp"] with 5 slots
# LONG READ OUT
names(wyoming)
# [1] "OBJECTID" "Shape_Leng" "Shape_Area" "COUNTYNAME" "YEARCOLLEC" "TAXYEAR" "TYPENAME" "RuleID"
We can check out the other slots of the wyoming
object to understand what other information is available. The wyoming@plotOrder shows us the order to plot the polygons, but this isn’t too important for us (at least initially speaking). The wyoming@bbox
gives us the bounding box of the overall spatial object. Looking at these numbers, I can tell these are not in a coordinate system I’m familiar with and is not Latitude and Longitude (degrees, minutes, seconds or decimal degrees). Checking the wyoming@proj4string
we might glean some information about its projection. Unfortunately there is no information. After checking the website, I was not readily able to identify the coordinate system used, but we should be able to plot the data regardless.
wyoming@plotOrder
# [1] 4 10 6 11 1 17 23 22 16 13 5 12 9 18 3 14 2 21 20 7 15 8 19
wyoming@bbox
# min max
# x -12362645 -11583006
# y 5011573 5622428
wyoming@proj4string
# CRS arguments: NA
If we needed to convert the coordinates we would need to find the coordinate system used and determine which coordinate system we wanted them to be in. From that point we could convert the data and continue on.
Data Transformation
In this section we will begin transforming the data to get it ready for visualizing.
wyoming@data$id <- rownames(wyoming@data)
wyoming.points <- fortify(wyoming)
# long lat order hole piece id group
# 1 -11971610 5275121 1 FALSE 1 0 0.1
# 2 -11971607 5275669 2 FALSE 1 0 0.1
# 3 -11971604 5276216 3 FALSE 1 0 0.1
# 4 -11971601 5276763 4 FALSE 1 0 0.1
Now we merger the two objects.
wyoming.df <-
left_join(x = wyoming.points,
y = wyoming@data,
by = "id")
head(wyoming.df)
# long lat order hole piece id group OBJECTID Shape_Leng Shape_Area COUNTYNAME YEARCOLLEC TAXYEAR TYPENAME RuleID
# 1 -11971610 5275121 1 FALSE 1 0 0.1 1 661865.3 26008733357 Natrona 2013 2014 County Boundary <NA>
# 2 -11971607 5275669 2 FALSE 1 0 0.1 1 661865.3 26008733357 Natrona 2013 2014 County Boundary <NA>
# 3 -11971604 5276216 3 FALSE 1 0 0.1 1 661865.3 26008733357 Natrona 2013 2014 County Boundary <NA>
# 4 -11971601 5276763 4 FALSE 1 0 0.1 1 661865.3 26008733357 Natrona 2013 2014 County Boundary <NA>
# 5 -11971601 5277270 5 FALSE 1 0 0.1 1 661865.3 26008733357 Natrona 2013 2014 County Boundary <NA>
At this point we can simply plot the data to see what it initially looks like so we can identify what we would like to do next.
ggplot(data = wyoming.df,
aes(x = long, y = lat, group = COUNTYNAME, fill =COUNTYNAME)) +
geom_polygon() +
geom_path(color = 'white') +
coord_equal() +
labs(x = 'Longitude', y = 'Latitude',
title = 'County Map of Wyoming') +
theme(panel.background = element_rect(fill = 'gray20'),
panel.border = element_rect(linetype = 'solid', fill = NA),
panel.spacing = unit(0.2, 'lines'),
strip.text = element_text(),
strip.background = element_rect(linetype = 'solid', color = 'black'),
axis.text = element_text(color = 'black'),
axis.ticks.length = unit(0, "cm"))
We can see from the plot that everything was plotted as we would hope. Now we want bring it closer to our initial goal, by putting the names of the counties in each of the Counties.
Label the Counties
The overall state is pretty square. Some of the counties are somewhat square. But some of the counties are irregular. We will explore different ways of applying a label to each of the counties. Ideally, we would want the label to be somewhat center on each polygon, while also not overlapping into a neighboring county.
Average Lat/Long Values
First, we will look at taking the average value along the Latitude and the Longitudinal axis.
cog_df <-
wyoming.df %>%
group_by(id, COUNTYNAME) %>%
summarise(long = mean(long),
lat = mean(lat))
# # A tibble: 23 x 4
# # Groups: id [23]
# id COUNTYNAME long lat
# <chr> <fct> <dbl> <dbl>
# 1 0 Natrona -11890980. 5290483.
# 2 1 Laramie -11661355. 5067093.
# 3 10 Park -12220736. 5487026.
# 4 11 Lincoln -12306601. 5217259.
I assigned the first plot to a variable wy_plot
to reduce repeating the same plotting code for each successive iteration. In this next portion of code, I’m removing the fill legend since the labels will be on the plot, making the legend unnecessary. We are also adding the annotate
function of ggplot2
library to apply the labels.
wy_plot +
guides(fill=FALSE) +
annotate("text",
x = cog_df$long,
y = cog_df$lat,
label = cog_df$COUNTYNAME,
size = 4)
We can see from the plot that this method works for most counties due to their relative squareness, but are very skewed on the irregularly shaped counties. This is caused by the density of points inherent in the details of certain boundary areas. The more points in a concentrated area, the more pull in that direction when we take the average along each axis.
Explore Other Methods on an Irregular Polygon
Now lets take an irregular polygon county, “Park” and explore how we could center this without manually having to adjust. In program the more that can be coded, the less time we spend manually adjusting one-offs. We are looking for the fastest and least resistance method, that way the same method can be applied to a multitude of variations and perform the way we would like each time.
Here are four methods. The first
sample_cog0 <-
sample_county %>%
group_by(id, COUNTYNAME) %>%
summarise(long = mean(long), lat = mean(lat))
sample_cog1 <-
sample_county %>%
group_by(id, COUNTYNAME) %>%
summarise(long = mean(c(max(long),min(long))),
lat = mean(c(max(lat),min(lat))))
sample_cog2 <-
wyoming@polygons[[11]]@labpt %>%
t() %>%
as.data.frame() %>%
mutate('id' = '10', 'COUNTYNAME' = 'Park') %>%
rename('long'='V1', 'lat'='V2') %>%
select(3,4,1,2)
sample_cog3 <-
wyoming%>%
gCentroid(.,byid=TRUE) %>%
as.data.frame() %>%
slice(11) %>%
mutate('id' = '10', 'COUNTYNAME' = 'Park') %>%
rename('long'='x', 'lat'='y') %>%
select(3,4,1,2)
sample_cog_df <- bind_rows(sample_cog0,
sample_cog1,
sample_cog2,
sample_cog3)
sample_cog_df$Method <- row.names(sample_cog_df)
From these methods we will be able to easily see where each would be located on the plot. The following image provide the corresponding example.
ggplot(data = sample_county,
aes(x = long, y = lat, group = COUNTYNAME, fill =COUNTYNAME)) +
geom_polygon() +
geom_path(color = 'white') +
coord_equal() +
guides(fill=FALSE) +
annotate("text",
x = sample_cog_df$long,
y = sample_cog_df$lat,
label = sample_cog_df$Method,
size = 4)
From this we can see that method 3 and 4 are the same. Method 3 is organic in the wyoming
object structure so there is less processing that would have to get done. However, if the polygon object does not have this feature, then the 4th method would be valid to use.
Conclusion
Using Method 3 (wyoming@polygons[[11]]@labpt
) we can generate the following visualization for the Counties in the state of Wyoming. In order to get Method 3 for the whole wyoming
object, we can accomplish that using the following snippet of code.
# This is a good function to extract all the labpt data from the object
county_labpt <-
lapply(X = 1:length(wyoming@polygons),
FUN = function(i) {
data.frame('id' = wyoming@polygons[[i]]@ID,
'long' = wyoming@polygons[[i]]@labpt[1],
'lat' = wyoming@polygons[[i]]@labpt[2])}) %>%
bind_rows() %>%
as.tibble()
county_labpt %<>%
left_join(x = .,
y = wyoming@data,
by = 'id')
The code to generate the visualization.
wy_plot +
guides(fill=FALSE) +
annotate("text",
x = county_labpt$long,
y = county_labpt$lat,
label = county_labpt$COUNTYNAME,
size = 4)
After inspection, if you see a result that doesn’t quite meet your expectation you could edit the appropriate values then rerun the plot code. If you will want to write off the results to a csv for future use and factor that into your script.
# If you wanted to manually adjust locations after this, you could
# edit the value and save off to a csv for future use.
county_labpt %>% edit()
county_labpt %>%
write_csv(., path = "./ProblemXSolutions.com/Wyoming/wyo_cog.csv")
GitHub Link
- https://github.com/problemxsolutions/wyoming-r/blob/master/wyo_county_processing_shapefile.R
- https://github.com/problemxsolutions/wyoming-r/blob/master/wyo_county_processing_shapefile_sf.R
What’s Next?
Now that we have created our spatial visualization using a shapefile, in the next post, we will continue the current example by applying the COVID data provided by the source mentioned above.