Saturday, July 20, 2013

Database - Data Warehouse - Staging Table Design

Staging Table

In data warehousing or reporting-driven OTLP system, it is very normal that people use staging table to store temporary data before they transfer/cleanse/transform the data into a permanent table.
One of the reasons that motivates us to use staging table is to make sure a clean copy of data is in its latest state and has been processed successfully before pushing it to final table. In another word, instead of locking live table for as long as the transformation process will take, we make all the heavy processing behind the scene. In the end of the process, we will just swap the table. This will reduce user interruption from hours to just milli- or microsecond.

The Design

Imagine we have a flat file with 10 columns. We will create a staging table with the exact same 10 columns with corresponding data type. And, we will leave all columns as nullable.

Why?
In each RDBMS, there is always a bulk loading utility to insert data into the table. And, always, this utility will perform much faster compare to any other DML command. For example, bcp.exe is the bulk loading utility in MS SQL Server. To import 1 million records into a table, bcp.exe took me 9.4 seconds on my machine. BULK INSERT took me 13 seconds. INSERT statement is the slowest - 1 minute. Also, bcp.exe has the minimum IO impact on harddisk as well.

Some people may comment Java/C#/C++ can do better job. However, I am not an expert in that area. So, I have no comment on that. :)

Next, when we have imported data into our staging table, we can create data validation script to identify bad data by batch. Well, in database world, everything is set-based. If we were to identity bad data row-by-row, might as well we do everything in Java/C#/C++. So, now, we will log all data violation information based on rules that user has provided.

After we have removing bad data (or unqualified records), we may perform process below:
1. create ID for new natural keys
2. map records to existing ID based on natural key
3. identify relationship between tables (find foreign key ID)
4. use SCD1, SCD2 or SCD3 method to insert data into final table(s)

Sometimes, due to complexity of the source data, we may need to create second/third level staging. This could be very helpful to perform step-by-step verification and increase data quality in the final table.

I hope you enjoy this article. Please feel free to comment. :)