Saturday, July 20, 2013

Database - Data Warehouse - Staging Table Design

Staging Table

In data warehousing or reporting-driven OTLP system, it is very normal that people use staging table to store temporary data before they transfer/cleanse/transform the data into a permanent table.
One of the reasons that motivates us to use staging table is to make sure a clean copy of data is in its latest state and has been processed successfully before pushing it to final table. In another word, instead of locking live table for as long as the transformation process will take, we make all the heavy processing behind the scene. In the end of the process, we will just swap the table. This will reduce user interruption from hours to just milli- or microsecond.

The Design

Imagine we have a flat file with 10 columns. We will create a staging table with the exact same 10 columns with corresponding data type. And, we will leave all columns as nullable.

Why?
In each RDBMS, there is always a bulk loading utility to insert data into the table. And, always, this utility will perform much faster compare to any other DML command. For example, bcp.exe is the bulk loading utility in MS SQL Server. To import 1 million records into a table, bcp.exe took me 9.4 seconds on my machine. BULK INSERT took me 13 seconds. INSERT statement is the slowest - 1 minute. Also, bcp.exe has the minimum IO impact on harddisk as well.

Some people may comment Java/C#/C++ can do better job. However, I am not an expert in that area. So, I have no comment on that. :)

Next, when we have imported data into our staging table, we can create data validation script to identify bad data by batch. Well, in database world, everything is set-based. If we were to identity bad data row-by-row, might as well we do everything in Java/C#/C++. So, now, we will log all data violation information based on rules that user has provided.

After we have removing bad data (or unqualified records), we may perform process below:
1. create ID for new natural keys
2. map records to existing ID based on natural key
3. identify relationship between tables (find foreign key ID)
4. use SCD1, SCD2 or SCD3 method to insert data into final table(s)

Sometimes, due to complexity of the source data, we may need to create second/third level staging. This could be very helpful to perform step-by-step verification and increase data quality in the final table.

I hope you enjoy this article. Please feel free to comment. :)

Tuesday, March 5, 2013

Database - SQL Server - Convert Delimited String into Rows

Scenario:
Developer comes to us with a problem statement: How do we convert a delimited string variable into rows using T-SQL?
For example:
DECLARE @str VARCHAR(200)='aaa,bbb,ccc,ddd'

Expected output:
  Result  
aaa
bbb
ccc
ddd


Okay, I manage to find 3 methods on the Internet. :D

Method 1
SELECT SUBSTRING(@str,CHARINDEX(',',','+@str+',',number),CHARINDEX(',',','+@str+',',number+1)-CHARINDEX(',',','+@str+',',number)-1)
FROM master.dbo.spt_values
WHERE name IS NULL AND number <= LEN(@str)
AND number=CHARINDEX(',',','+@str+',',number)

Method 2
DECLARE @xml xml
SET @xml=N''+REPLACE(@str,',','
')+''SELECT x.value('.','NVARCHAR(255)')
FROM @xml.node('//root/r's) AS a(x)

Method 3
SET @str='SELECT '''+REPLACE(@str,',',''' UNION ALL SELECT ''')+''''
EXEC(@str)

Somehow, I find that method 3 is the fastest on my machine.
Some benchmark data:
MethodAvg Speed (ms)
Method 1143
Method 233
Method 33


Do you get same result on your machine?

Monday, March 4, 2013

Database - SQL Server - Comparing N Column(s)

Background
Recently, I ran into a scenario where my web developer would like to allow his user to download data from a 201-column table into a CSV file. Then, user will modify (add, modify, or remove) records in the CSV and upload the data back to the system again. For each batch, system needs to compare about 100K rows of record between download and upload.
In addition, primary key column will not be downloaded. User can only see 200 columns in the CSV.

Let say, my table looks like this:
CREATE TABLE myTable (
id INT IDENTITY(1,1) PRIMARY KEY,
col1 varchar(100) NULL,
col2 varchar(100) NULL,
col3 varchar(100) NULL,
.
.
.
col200 varchar(100) NULL
)

What is the fastest way to identity changes?

Solutions
I thought of 3 possible methods. Well, you may have more solutions. Please feel free to share with me.
Well, first step is to upload the file via BCP or BULK INSERT into a temporary table called #upload.

Method 1
Use pure FULL OUTER JOIN

SELECT t1.col1, t2.col1,t1.col2,t2.col2,t1.col3,t2.col3,.....,t1.col200,t2.col200
FROM myTable t1 FULL OUTER JOIN #upload t2 
ON t1.col1=t2.col1 AND t1.col2=t2.col2 AND t1.col3=t2.col3 ... AND t1.col200=t2.col200
WHERE t1.col1 IS NULL OR t2.col1 IS NULL

Method 2
Use EXCEPT

SELECT col1,col2,col3,...,col200 FROM myTable
EXCEPT
SELECT col1,col2,col3,...,col200 FROM #upload


SELECT col1,col2,col3,...,col200 FROM #upload
EXCEPT
SELECT col1,col2,col3,...,col200 FROM myTable


 By combining results from 2 EXCEPT statements above, we are able to get the delta records. Easy and clean.

Method 3
Use additional MD5 computed column + FULL OUTER JOIN

In order to use this method, we need to create a persisted/non-persisted computed column on myTable and #upload. The syntax should look like this (you can change the delimiter if you want):

ALTER TABLE myTable ADD md5 AS HASHBYTES(md5, col1+'||'+col2+'||'+col3+'||'+...+'||'+col200)
ALTER TABLE #upload ADD md5 AS HASHBYTES(md5, col1+'||'+col2+'||'+col3+'||'+...+'||'+col200)

If it is a persisted column, we can create an index on our md5 column. Anyway, our join will become like this:

SELECT t1.md5,t2.md5 FROM myTable t1 FULL OUTER JOIN #upload t2 ON t1.md5=t2.md5
WHERE t1.md5 IS NULL OR t2.md5 IS NULL

Assumptions
1. User is not allowed to delete any column in the CSV file.
2. Number of columns must be greater than "certain" number. I didn't have time to try it out. However, if there are only 3 to 5 columns, it does not worth the effort to use method 3. :)


Result
Based on my own benchmark (your machine may give you different result though), method 3 runs much faster than method 2. Method 1 is the slowest. Also, it is not possible to create an index on those 200 columns.
I have tried method 3 on a 23-column and a 201-column tables. Both works perfectly fast for me.

Please feel free to post your comment if you find any of the method works for you. Cheers! ;)