Derive a Star Schema By Example - Problem(x) Solutions

This post will describe the implementation of the star schema using the “Crime” table from my Criminal Analysis project. The original table represents criminal incidents in Washington DC from 2009 through 2020 (October). The table also has 23 columns, 7 of which are spatial grouping categories.

I’ll demonstrate how to decompose the table of data I have, which was likely created from a star schema at its original source, into a star schema using SQL statements. Although this might not be the best example it still demonstrates some of the core concepts of fact and dimension. I will be performing the work using pgAdmin for my PostgreSQL database. The following are my outlined objectives:

Creating the dimensional tables from the original columns
Creating a fact table

Background

Original Crime Table

My original data was compiled from each of the years as was available from the OpenData DC website. Using R, I downloaded each GeoJSON file and then loaded them into my PostgreSQL database I created for my Criminal Analysis project.

In all likelihood the downloaded data exists in its original data warehouse as a star schema or one of its family relatives (i.e. snowflake, constellation).

For information, refer to the following posts for additional back and operations performed thus far:

Example Data

To simplify my data, I reduced the crime table to just the spatial grouping fields (Ward, District, PSA, etc). Seven fields total. This should provide a sufficient amount of data and variety to explain and demonstrate the concepts. If you have followed the previous posts and have the crime table in your database, you can simply run the following SQL statement to create a new table called dc_spatial_ref:

-- Create dc_spatial_ref table.
SELECT DISTINCT
  "WARD" as ward, 
  "PSA" as psa,
  "DISTRICT" as district,
  "ANC" as anc, 
  "VOTING_PRECINCT" as voting_precinct, 
  "NEIGHBORHOOD_CLUSTER" as neighborhood_cluster, 
  "CENSUS_TRACT" as census_tract
INTO TABLE dc_spatial_ref
FROM crime
ORDER BY "WARD";

What is Dimensional Modeling?

Dimensional modeling is a methodology used to design data warehouse structures using the concepts of facts and dimensions to describe the data. Facts take on the form of numeric values that serve as a key to their definitions held in dimensions. The aggregation of facts for each area of the data model create a fact table. The dimensions are further defined in tables.

If I were creating the model from scratch, I would be using the business model to define and map out the logical and/or physical model. The information would be defined by needs and what is currently being collected throughout the business.

For the purpose of this example, I’ll demonstrate creating a Star Schema from the data I have using SQL statements. The data I am using is the end result of queries and joins to generate the tables of data for analysis. I’m revere engineering the data to demonstrate what the .

What is a Star Schema?

The star schema is a denormalized design that consists of a fact table surrounded by dimension tables. The organization of dimensional tables to a fact table, depicted in its entity-relationship diagram, gives the appearance of a star.

The goal of the star schema is to create a simplistic structure that focuses on high performing data retrieval. It is used for many tools and applications.

The schema design is easy to understand due to its less abstracted nature than what you would see from a Third Normal Form. This is because it focuses on the needs of business intelligence users/analysts. By their very nature they are not normalized.

In the normal modeling process, there are four steps for creating the denormalized schema.

Choose Business Objective
Identify Granularity
Identify Dimension and its Attributes
Identify Fact

Derivation Example Note

Because of the nature of this derivation, I have not gone through the steps. But you can see that I have performed most in reverse order.

Creating the Look-up Tables

First, to create the dimension tables (look-up tables) I need to make each attribute in my newly created dc_spatial_ref table into its own unique table. Below you can see the SQL statements I use to perform the task across my entire table.

SQL Statements

The following template is what I used to create an index table for a general column and then insert the data from my original table into the table.

First, I drop the table if it exists. I only have this here because I was experimenting. Once a table is defined, its likely not recommended to redefine the table. This could have serious cascading effects to all other tables that reference the original primary key. For my post, I wanted to make sure I tested everything before committing, as well as providing bad information.

-- Template for creating lookup tables per column
DROP TABLE IF EXISTS [table_name]_index;

Next I create the table, calling it “[table_name]_index”. For a simple example let’s define [table_name]=ward. The original column name in my “crime” table is “WARD”. I am creating ward_index as my table with two columns, ward_id and ward. The ward_id will serve as my primary key for this table. If later there is an additional ward, I can insert the new value into my table and the ward_id will increment 1. Similarly, if there is information about the dimension that I can add (extend), then i can alter the table without affecting the fact table. The modification expands what I can query.

-- CREATE TABLE Template for each column that is going to become a lookup table.
CREATE TABLE [table_name]_index (
  [column_name]_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  [column_name] VARCHAR --(or INT)
);

Finally, I populate my new data with the unique data in my original data table.

-- Insert the data into the new table from a Select query.
INSERT INTO [table_name]_index ([column_name])
SELECT DISTINCT "[column_name_old]" as [column_name] 
FROM dc_spatial_ref 
ORDER BY "[column_name_old]";

Putting it all together

Once I had my template structure, I through the old column names, new column names and the template formula into a spreadsheet to quickly write all my SQL statements. The following image shows what it looked like:

I used the following formula in LibreOffice Calc (MS Excel variant) to generalize the process.

MS Excel / LibreOffice Calc
="DROP TABLE IF EXISTS "&B2&"_index;CREATE TABLE "&B2&"_index ("&B2&"_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, "&B2&" VARCHAR );INSERT INTO "&B2&"_index ("&B2&") SELECT DISTINCT """&A2&""" as "&B2&" FROM dc_spatial_ref ORDER BY """&A2&""";"

Use can use a similar template to create the remaining SQL statements as needed.

Execute the final SQL statement

The template allows me to copy and paste the final evaluated text into my Query tool in pgAdmin.

Output

Now, all my tables with distinct values and unique identifiers have been created. I can spot check as needed after refreshing my table list. In this instance I’m showing you my ward_index table as validation.

Creating a Fact Table

Now that I have created all the dimension tables (think look-up tables) for each of the spatial grouping features, I will create a fact table using the primary keys in my dimension tables as foreign keys for each attribute (column) in the fact table. I will also assign a primary key that can then be referenced in an alternative version of my crime table.

Part 1

SQL Statements

First, I’ll construct the query so i can see the original and the referenced index attributes. I like to visually validate before I execute a query that creates or modifies anything. it provides on last check before committing to something.

SELECT 
  dc_spatial_ref.*,
  ward_index.ward_id,
  psa_index.psa_id, 
  district_index.district_id, 
  anc_index.anc_id, 
  voting_precinct_index.voting_precinct_id, 
  neighborhood_cluster_index.neighborhood_cluster_id, 
  census_tract_index.census_tract_id 
FROM dc_spatial_ref
LEFT JOIN ward_index ON dc_spatial_ref.ward = ward_index.ward
LEFT JOIN psa_index ON dc_spatial_ref.psa = psa_index.psa 
LEFT JOIN district_index ON dc_spatial_ref.district = district_index.district 
LEFT JOIN anc_index ON dc_spatial_ref.anc = anc_index.anc 
LEFT JOIN voting_precinct_index ON dc_spatial_ref.voting_precinct = voting_precinct_index.voting_precinct 
LEFT JOIN neighborhood_cluster_index ON dc_spatial_ref.neighborhood_cluster = neighborhood_cluster_index.neighborhood_cluster 
LEFT JOIN census_tract_index ON dc_spatial_ref.census_tract = census_tract_index.census_tract;

The final SQL statement to insert the fields into a table called dc_spatial_fact_table.

SELECT 
  ward_index.ward_id,
  psa_index.psa_id, 
  district_index.district_id, 
  anc_index.anc_id, 
  voting_precinct_index.voting_precinct_id, 
  neighborhood_cluster_index.neighborhood_cluster_id, 
  census_tract_index.census_tract_id 
INTO TABLE dc_spatial_fact_table
FROM dc_spatial_ref
LEFT JOIN ward_index ON dc_spatial_ref.ward = ward_index.ward
LEFT JOIN psa_index ON dc_spatial_ref.psa = psa_index.psa 
LEFT JOIN district_index ON dc_spatial_ref.district = district_index.district 
LEFT JOIN anc_index ON dc_spatial_ref.anc = anc_index.anc 
LEFT JOIN voting_precinct_index ON dc_spatial_ref.voting_precinct = voting_precinct_index.voting_precinct 
LEFT JOIN neighborhood_cluster_index ON dc_spatial_ref.neighborhood_cluster = neighborhood_cluster_index.neighborhood_cluster 
LEFT JOIN census_tract_index ON dc_spatial_ref.census_tract = census_tract_index.census_tract;

Part 2

Alter Table

Next, I need to alter the newly create table to include the id column, which will be the primary key. Then I’ll alter the remaining columns to become foreign keys.

ALTER TABLE dc_spatial_fact_table ADD id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY;
  
ALTER TABLE dc_spatial_fact_table ADD FOREIGN KEY (ward_id) REFERENCES ward_index(ward_id);
ALTER TABLE dc_spatial_fact_table ADD FOREIGN KEY (psa_id) REFERENCES psa_index(psa_id);
ALTER TABLE dc_spatial_fact_table ADD FOREIGN KEY (district_id) REFERENCES district_index(district_id);
ALTER TABLE dc_spatial_fact_table ADD FOREIGN KEY (anc_id) REFERENCES anc_index(anc_id);
ALTER TABLE dc_spatial_fact_table ADD FOREIGN KEY (voting_precinct_id) REFERENCES voting_precinct_index(voting_precinct_id);
ALTER TABLE dc_spatial_fact_table ADD FOREIGN KEY (neighborhood_cluster_id) REFERENCES neighborhood_cluster_index(neighborhood_cluster_id);
ALTER TABLE dc_spatial_fact_table ADD FOREIGN KEY (census_tract_id) REFERENCES census_tract_index(census_tract_id);

Below you can see my input and output from the SQL statement in pgAdmin.

Output

Here is the resultant fact table.

What’s Next?

After going through the example above, I could apply the concept to my full criminal activity analysis project data. The resultant entity-relationship diagram could look something like this:

Each reference table would have all the necessary attributes with a primary key. The crime fact table would then have its own unique primary key with each dimension referenced per attribute in the table using their primary keys as foreign keys. The full table might resemble a Snowflake schema or a constellation schema when its all built out though. The core to each of those is the star schema base.

Using this project model, the dc_spatial_ref table that this post initially created would be the dimension table with an appended primary key id attribute.

SQL and Database Resources

If you have not learned much SQL and Databases, I would highly recommend setting up an account at HackerRank.com. It is free and provides good problems to solve, which gives you experience. Work through their SQL and Database problems. SQL is a very valuable skill to have and often expected for Data Scientists to know. If you are interested in Database problems, they have those too. The image below shows you all the skills available to learn from their platform.

For referencing or looking up commands, syntax, and examples use W3Schools.com and tutorialspoint.com.

Database Resources

For official information and documentation on PostgreSQL check out their website and documentation at https://www.postgresql.org/docs/13/index.html.

For the official pgAdmin documentation, please refer to this: https://www.pgadmin.org/docs/pgadmin4/4.28/index.html

There are all sorts of online resources and lots of books out there to help educate you on more database skills.

Background

Original Crime Table

Example Data

What is Dimensional Modeling?

What is a Star Schema?

Derivation Example Note

Creating the Look-up Tables

SQL Statements

Putting it all together

Execute the final SQL statement

Output

Creating a Fact Table

Part 1

SQL Statements

Part 2

Alter Table

Output

What’s Next?

SQL and Database Resources

Database Resources

PostgreSQL and SQL Books on Amazon