Project Workflow: Data Transformation

***This was created and valid in 2016. Data for this specific data project may no longer be current***

Data Transformation

The transformation stage is where you develop, add or enhance the data you have to suit your follow-on analysis and visualization requirements. You may revisit the transformation stage many times throughout a project because you later identify some characteristic or feature you thought of that might help you in a particular manner.

Transform Data Fields

For this particular data set, we do not necessarily need to transform the basic textual data fields. As we saw looking at the unique values in the preparation stage, the value in the “Shift”, “Offense”, and “Method” columns appear to be just fine. If there were lots of values, we may think about categorizing the values to help summarize the data. Having to deal with lots of unique values can make the analysis and visualization stages more complicated. If aggregating values into categories detracts from the overall meaning or focus of your analysis, then it would not be advised. Read the attribute descriptions if the data you have provides that. The descriptions may help you determine which variable align the closest if you do decide to aggregate and categorize.

Sometimes when you go to visualize a variable, the label values can be too long causing overlapping of labels. In this instance you may want to add new lines to the values used in the labels. To do this we gauge how much text should appear in each line when it plots. Looking at the “Offense” variable distinct values, we can see that “MOTOR VEHICLE THEFT” and “ASSAULT W/DANGEROUS WEAPON” are much longer than the other values. For those values, we will add a new line break (\n) in the positions that will make the labels shorter in width. When can also spell out “W/” and “F/” to become with and from. This information was spelled out in the attribute definitions on the data overview page. In OffenseVect below, I tailored each of the values for aesthetic purposes. We could add this data to the data frame or we could leave them as separate vectors for plotting purposes only.

unique(data_2011_2015$SHIFT)
# [1] "MIDNIGHT" "DAY"      "EVENING" 

unique(data_2011_2015$OFFENSE)
# [1] "SEX ABUSE"                  "HOMICIDE"                   "BURGLARY"                   "MOTOR VEHICLE THEFT"        "ROBBERY"                    "THEFT F/AUTO"            
# [7] "ASSAULT W/DANGEROUS WEAPON" "THEFT/OTHER"                "ARSON"  

unique(data_2011_2015$METHOD)
# [1] "OTHERS" "GUN"    "KNIFE" 

# Creating the needed cross-referenced variables
OffenseVect<-c("ARSON"="Arson",
               "ASSAULT W/DANGEROUS WEAPON"="Assault\n(Dangerous\nWeapon)",
               "BURGLARY"="Burglary","HOMICIDE"="Homicide",
               "MOTOR VEHICLE THEFT"="Motor\n Vehicle\nTheft",
               "ROBBERY"="Robbery",
               "SEX ABUSE"="Sex\nAbuse",
               "THEFT F/AUTO"="Theft\n(From Auto)",
               "THEFT/OTHER"="Theft\n(Other)")

OffenseVect_alt<-c("Arson","Assault\n(Dangerous\nWeapon)","Burglary","Homicide","Motor\n Vehicle\nTheft",
                   "Robbery","Sex\nAbuse","Theft\n(From Auto)","Theft\n(Other)")

Enhance Temporal Fields

The date/time fields are formatted in mm/dd/yyyy hh:mm:ss AM/PM format. When it comes to R, we need to read this in a certain way. Otherwise we may not get the output we want from a date/time field. If we take data_2011_2015$REPORTDATETIME, the appropriate code to understand this is %m/%d/%Y %I:%M:%S %p, which represents the order for how to read it. You can look up the formatting of dates and times using the strptime() function documentation.

# This is the first 6 values in the field
head(data_2011_2015$REPORTDATETIME)

# This is how we would convert the character values into date/time values
head(strptime(data_2011_2015$REPORTDATETIME, format = "%m/%d/%Y %I:%M:%S %p"))
head(strptime(data_2011_2015$REPORTDATETIME, format = "%m/%d/%Y %r"))

# When we are satisfied, we can commit the conversion into a new variable so we can always go back to the source value as they were received.
data_2011_2015$REPORTDATETIME_DTG <- strptime(data_2011_2015$REPORTDATETIME, format = "%m/%d/%Y %r")

# Derive Year
data_2011_2015$REPORTDATETIME_yr <- format(data_2011_2015$REPORTDATETIME_DTG, "%Y")

# Derive Year-Month
data_2011_2015$REPORTDATETIME_yrmn <- format(data_2011_2015$REPORTDATETIME_DTG, "%Y-%m")
  
# Derive Month (%B will give the full month name, %b will give the abbreviated name, %m will give the numeric value)
data_2011_2015$REPORTDATETIME_month <- format(data_2011_2015$REPORTDATETIME_DTG, "%B")
  
# Derive Day of Week (Sunday - Saturday) (%w will give the numeric day 0-6, 
#                                         %u is week day 1-7, with Monday=1.
#                                         %A is the full weekday name,
#                                         %a is the abbreviated weekday name)
data_2011_2015$REPORTDATETIME_dow <- format(data_2011_2015$REPORTDATETIME_DTG, "%A")
  
# Derive Week Number (%U is 00-53 using Sunday as the first day of the week, 
#                     %V is the week of the year defined by ISO 8601, 
#                     %W is 00-53 using Monday as the first day of the week)
data_2011_2015$REPORTDATETIME_wknum <- format(data_2011_2015$REPORTDATETIME_DTG, "%U")
  
# Derive Hour of Day 
data_2011_2015$REPORTDATETIME_hour <- format(data_2011_2015$REPORTDATETIME_DTG, "%H")
  
# Derive AM/PM
data_2011_2015$REPORTDATETIME_ampm <- format(data_2011_2015$REPORTDATETIME_DTG, "%p")

# Derive just the date
data_2011_2015$REPORTDATETIME_date <- format(data_2011_2015$REPORTDATETIME_DTG, "%m/%d/%Y")

  
# Create Date Without distinct year
data_2011_2015$REPORTDATETIME_woyr <- format(data_2011_2015$REPORTDATETIME_DTG, "%m/%d")
  
#
# Creating the needed cross-referenced variables

monthLables=c('1'='JAN','2'='FEB','3'='MAR','4'='APR','5'='MAY','6'='JUN',
              '7'='JUL','8'='AUG','9'='SEP','10'='OCT','11'='NOV','12'='DEC')

Transform Spatial Fields

This is the data transformation of the coordinates. During the Preparation phase, I identified that the coordinates provided in the data were not true Latitude and Longitude coordinates the way we think of them. The coordinates in this data set represent coordinates using the Maryland State Plane system. In order to transform the coordinates into true decimal degree latitude and longitude, we need to know how to translate MSP coordinates into NAD83 coordinates. This transformation will allow us to conduct follow on spatial analysis and visualization of the data appropriately. If I were not planning to visualize the data on a map, this may not be something we would need to do.

It took some searching to find how to convert the MSP coordinates into normal decimal degree Lat/Lon coordinates. Some of my queries were {“maryland state plane”, “latitude”, “longitude”, “decimal degrees”}. Eventually I was able to find the Maryland Department of Natural Resources site and a Stack Overflow post. These sites both helped my understanding of the problem, as well as solutions to conduct the appropriate spatial transformation, in R at that.

I will also note that I did not initially find each of the sites just mentioned, but it took modifying my queries over and over, as well as combing through sites and posts to get what I needed. This type of trench work is necessary because you are not always going to find what you need right away. Sometimes you get bits and pieces from other people, but not quite exactly what you need specifically. From there you need to be able to know how to look given the clues you find along the way.

#Convert Maryland State Plane coordinates provided in the data
nad83_coords <- data.frame(x=data_2011_2015$BLOCKXCOORD, y=data_2011_2015$BLOCKYCOORD)
coordinates(nad83_coords) <- c('x', 'y')
proj4string(nad83_coords)<-CRS("+init=esri:102285")

# Transform the spatial coordinates using the specified projection
ConvertedCoords<-spTransform(nad83_coords,CRS("+init=epsg:4326"))

# Tidy up the results
ConvertedCoords<-as.data.frame(ConvertedCoords)
colnames(ConvertedCoords)<-c('long','lat')

# Load the Results back into the dataset
data_2011_2015<-cbind(data_2011_2015,
                 ConvertedCoords)

Conclusion

Now that we have completed the necessary transformations and data enhancements, we can proceed on to Exploration stage. Once we move past this stage, it does not mean we will need to come back to it and add more information. Some of this comes out when we try to analyze something and realize that what we really need needs to be teased out. The more experience you gain, the more you will recognize what you need earlier on. Its all part of the feedback loop we all have in the learning process. This also highlights the importance of documenting what you do along the way because you can build upon what you have done previously.

Data Transformation

Transform Data Fields

Enhance Temporal Fields

Transform Spatial Fields

Conclusion

Associated Posts