This post will describe the implementation of the star schema using the “Crime” table from my Criminal Analysis project. The original table represents criminal incidents in Washington DC from 2009 through 2020 (October). The table also has 23 columns, 7 of which are spatial grouping categories.
I’ll demonstrate how to decompose the table of data I have, which was likely created from a star schema at its original source, into a star schema using SQL statements. Although this might not be the best example it still demonstrates some of the core concepts of fact and dimension. I will be performing the work using pgAdmin for my PostgreSQL database. The following are my outlined objectives:
- Creating the dimensional tables from the original columns
- Creating a fact table
Background
Original Crime Table
My original data was compiled from each of the years as was available from the OpenData DC website. Using R, I downloaded each GeoJSON file and then loaded them into my PostgreSQL database I created for my Criminal Analysis project.
In all likelihood the downloaded data exists in its original data warehouse as a star schema or one of its family relatives (i.e. snowflake, constellation).
For information, refer to the following posts for additional back and operations performed thus far:
Example Data
To simplify my data, I reduced the crime
table to just the spatial grouping fields (Ward, District, PSA, etc). Seven fields total. This should provide a sufficient amount of data and variety to explain and demonstrate the concepts. If you have followed the previous posts and have the crime
table in your database, you can simply run the following SQL statement to create a new table called dc_spatial_ref
:
-- Create dc_spatial_ref table.
SELECT DISTINCT
"WARD" as ward,
"PSA" as psa,
"DISTRICT" as district,
"ANC" as anc,
"VOTING_PRECINCT" as voting_precinct,
"NEIGHBORHOOD_CLUSTER" as neighborhood_cluster,
"CENSUS_TRACT" as census_tract
INTO TABLE dc_spatial_ref
FROM crime
ORDER BY "WARD";
What is Dimensional Modeling?
Dimensional modeling is a methodology used to design data warehouse structures using the concepts of facts and dimensions to describe the data. Facts take on the form of numeric values that serve as a key to their definitions held in dimensions. The aggregation of facts for each area of the data model create a fact table. The dimensions are further defined in tables.
If I were creating the model from scratch, I would be using the business model to define and map out the logical and/or physical model. The information would be defined by needs and what is currently being collected throughout the business.
For the purpose of this example, I’ll demonstrate creating a Star Schema from the data I have using SQL statements. The data I am using is the end result of queries and joins to generate the tables of data for analysis. I’m revere engineering the data to demonstrate what the .
What is a Star Schema?
The star schema is a denormalized design that consists of a fact table surrounded by dimension tables. The organization of dimensional tables to a fact table, depicted in its entity-relationship diagram, gives the appearance of a star.
The goal of the star schema is to create a simplistic structure that focuses on high performing data retrieval. It is used for many tools and applications.
The schema design is easy to understand due to its less abstracted nature than what you would see from a Third Normal Form. This is because it focuses on the needs of business intelligence users/analysts. By their very nature they are not normalized.
In the normal modeling process, there are four steps for creating the denormalized schema.
- Choose Business Objective
- Identify Granularity
- Identify Dimension and its Attributes
- Identify Fact
Derivation Example Note
Because of the nature of this derivation, I have not gone through the steps. But you can see that I have performed most in reverse order.
Creating the Look-up Tables
First, to create the dimension tables (look-up tables) I need to make each attribute in my newly created dc_spatial_ref
table into its own unique table. Below you can see the SQL statements I use to perform the task across my entire table.
SQL Statements
The following template is what I used to create an index table for a general column and then insert the data from my original table into the table.
First, I drop the table if it exists. I only have this here because I was experimenting. Once a table is defined, its likely not recommended to redefine the table. This could have serious cascading effects to all other tables that reference the original primary key. For my post, I wanted to make sure I tested everything before committing, as well as providing bad information.
-- Template for creating lookup tables per column
DROP TABLE IF EXISTS [table_name]_index;
Next I create the table, calling it “[table_name]_index”. For a simple example let’s define [table_name]=ward. The original column name in my “crime” table is “WARD”. I am creating ward_index
as my table with two columns, ward_id
and ward
. The ward_id
will serve as my primary key for this table. If later there is an additional ward, I can insert the new value into my table and the ward_id
will increment 1. Similarly, if there is information about the dimension that I can add (extend), then i can alter the table without affecting the fact table. The modification expands what I can query.
-- CREATE TABLE Template for each column that is going to become a lookup table.
CREATE TABLE [table_name]_index (
[column_name]_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
[column_name] VARCHAR --(or INT)
);
Finally, I populate my new data with the unique data in my original data table.
-- Insert the data into the new table from a Select query.
INSERT INTO [table_name]_index ([column_name])
SELECT DISTINCT "[column_name_old]" as [column_name]
FROM dc_spatial_ref
ORDER BY "[column_name_old]";
Putting it all together
Once I had my template structure, I through the old column names, new column names and the template formula into a spreadsheet to quickly write all my SQL statements. The following image shows what it looked like:
I used the following formula in LibreOffice Calc (MS Excel variant) to generalize the process.
MS Excel / LibreOffice Calc ="DROP TABLE IF EXISTS "&B2&"_index;CREATE TABLE "&B2&"_index ("&B2&"_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, "&B2&" VARCHAR );INSERT INTO "&B2&"_index ("&B2&") SELECT DISTINCT """&A2&""" as "&B2&" FROM dc_spatial_ref ORDER BY """&A2&""";"
Use can use a similar template to create the remaining SQL statements as needed.
Execute the final SQL statement
The template allows me to copy and paste the final evaluated text into my Query tool in pgAdmin.
Output
Now, all my tables with distinct values and unique identifiers have been created. I can spot check as needed after refreshing my table list. In this instance I’m showing you my ward_index
table as validation.
Creating a Fact Table
Now that I have created all the dimension tables (think look-up tables) for each of the spatial grouping features, I will create a fact table using the primary keys in my dimension tables as foreign keys for each attribute (column) in the fact table. I will also assign a primary key that can then be referenced in an alternative version of my crime
table.
Part 1
SQL Statements
First, I’ll construct the query so i can see the original and the referenced index attributes. I like to visually validate before I execute a query that creates or modifies anything. it provides on last check before committing to something.
SELECT
dc_spatial_ref.*,
ward_index.ward_id,
psa_index.psa_id,
district_index.district_id,
anc_index.anc_id,
voting_precinct_index.voting_precinct_id,
neighborhood_cluster_index.neighborhood_cluster_id,
census_tract_index.census_tract_id
FROM dc_spatial_ref
LEFT JOIN ward_index ON dc_spatial_ref.ward = ward_index.ward
LEFT JOIN psa_index ON dc_spatial_ref.psa = psa_index.psa
LEFT JOIN district_index ON dc_spatial_ref.district = district_index.district
LEFT JOIN anc_index ON dc_spatial_ref.anc = anc_index.anc
LEFT JOIN voting_precinct_index ON dc_spatial_ref.voting_precinct = voting_precinct_index.voting_precinct
LEFT JOIN neighborhood_cluster_index ON dc_spatial_ref.neighborhood_cluster = neighborhood_cluster_index.neighborhood_cluster
LEFT JOIN census_tract_index ON dc_spatial_ref.census_tract = census_tract_index.census_tract;
The final SQL statement to insert the fields into a table called dc_spatial_fact_table
.
SELECT
ward_index.ward_id,
psa_index.psa_id,
district_index.district_id,
anc_index.anc_id,
voting_precinct_index.voting_precinct_id,
neighborhood_cluster_index.neighborhood_cluster_id,
census_tract_index.census_tract_id
INTO TABLE dc_spatial_fact_table
FROM dc_spatial_ref
LEFT JOIN ward_index ON dc_spatial_ref.ward = ward_index.ward
LEFT JOIN psa_index ON dc_spatial_ref.psa = psa_index.psa
LEFT JOIN district_index ON dc_spatial_ref.district = district_index.district
LEFT JOIN anc_index ON dc_spatial_ref.anc = anc_index.anc
LEFT JOIN voting_precinct_index ON dc_spatial_ref.voting_precinct = voting_precinct_index.voting_precinct
LEFT JOIN neighborhood_cluster_index ON dc_spatial_ref.neighborhood_cluster = neighborhood_cluster_index.neighborhood_cluster
LEFT JOIN census_tract_index ON dc_spatial_ref.census_tract = census_tract_index.census_tract;
Part 2
Alter Table
Next, I need to alter the newly create table to include the id
column, which will be the primary key. Then I’ll alter the remaining columns to become foreign keys.
ALTER TABLE dc_spatial_fact_table ADD id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY;
ALTER TABLE dc_spatial_fact_table ADD FOREIGN KEY (ward_id) REFERENCES ward_index(ward_id);
ALTER TABLE dc_spatial_fact_table ADD FOREIGN KEY (psa_id) REFERENCES psa_index(psa_id);
ALTER TABLE dc_spatial_fact_table ADD FOREIGN KEY (district_id) REFERENCES district_index(district_id);
ALTER TABLE dc_spatial_fact_table ADD FOREIGN KEY (anc_id) REFERENCES anc_index(anc_id);
ALTER TABLE dc_spatial_fact_table ADD FOREIGN KEY (voting_precinct_id) REFERENCES voting_precinct_index(voting_precinct_id);
ALTER TABLE dc_spatial_fact_table ADD FOREIGN KEY (neighborhood_cluster_id) REFERENCES neighborhood_cluster_index(neighborhood_cluster_id);
ALTER TABLE dc_spatial_fact_table ADD FOREIGN KEY (census_tract_id) REFERENCES census_tract_index(census_tract_id);
Below you can see my input and output from the SQL statement in pgAdmin.
Output
Here is the resultant fact table.
What’s Next?
After going through the example above, I could apply the concept to my full criminal activity analysis project data. The resultant entity-relationship diagram could look something like this:
Each reference table would have all the necessary attributes with a primary key. The crime fact table would then have its own unique primary key with each dimension referenced per attribute in the table using their primary keys as foreign keys. The full table might resemble a Snowflake schema or a constellation schema when its all built out though. The core to each of those is the star schema base.
Using this project model, the dc_spatial_ref
table that this post initially created would be the dimension table with an appended primary key id attribute.
SQL and Database Resources
If you have not learned much SQL and Databases, I would highly recommend setting up an account at HackerRank.com. It is free and provides good problems to solve, which gives you experience. Work through their SQL and Database problems. SQL is a very valuable skill to have and often expected for Data Scientists to know. If you are interested in Database problems, they have those too. The image below shows you all the skills available to learn from their platform.
For referencing or looking up commands, syntax, and examples use W3Schools.com and tutorialspoint.com.
Database Resources
For official information and documentation on PostgreSQL check out their website and documentation at https://www.postgresql.org/docs/13/index.html.
For the official pgAdmin documentation, please refer to this: https://www.pgadmin.org/docs/pgadmin4/4.28/index.html
There are all sorts of online resources and lots of books out there to help educate you on more database skills.