Leo Leong's Blog: Database

Showing posts with label Database. Show all posts

Saturday, July 20, 2013

Database - Data Warehouse - Staging Table Design

Staging Table

In data warehousing or reporting-driven OTLP system, it is very normal that people use staging table to store temporary data before they transfer/cleanse/transform the data into a permanent table.
One of the reasons that motivates us to use staging table is to make sure a clean copy of data is in its latest state and has been processed successfully before pushing it to final table. In another word, instead of locking live table for as long as the transformation process will take, we make all the heavy processing behind the scene. In the end of the process, we will just swap the table. This will reduce user interruption from hours to just milli- or microsecond.

The Design

Imagine we have a flat file with 10 columns. We will create a staging table with the exact same 10 columns with corresponding data type. And, we will leave all columns as nullable.

Why?
In each RDBMS, there is always a bulk loading utility to insert data into the table. And, always, this utility will perform much faster compare to any other DML command. For example, bcp.exe is the bulk loading utility in MS SQL Server. To import 1 million records into a table, bcp.exe took me 9.4 seconds on my machine. BULK INSERT took me 13 seconds. INSERT statement is the slowest - 1 minute. Also, bcp.exe has the minimum IO impact on harddisk as well.

Some people may comment Java/C#/C++ can do better job. However, I am not an expert in that area. So, I have no comment on that. :)

Next, when we have imported data into our staging table, we can create data validation script to identify bad data by batch. Well, in database world, everything is set-based. If we were to identity bad data row-by-row, might as well we do everything in Java/C#/C++. So, now, we will log all data violation information based on rules that user has provided.

After we have removing bad data (or unqualified records), we may perform process below:
1. create ID for new natural keys
2. map records to existing ID based on natural key
3. identify relationship between tables (find foreign key ID)
4. use SCD1, SCD2 or SCD3 method to insert data into final table(s)

Sometimes, due to complexity of the source data, we may need to create second/third level staging. This could be very helpful to perform step-by-step verification and increase data quality in the final table.

I hope you enjoy this article. Please feel free to comment. :)

Monday, October 8, 2012

Database - SQL Server - Logging

I would like to talk about stored procedure logging today. Many DBAs or database developers always have a so called "logging framework" in their applications. I would like to share my "logging framework" in this blog. I hope you guys can provide some thought to make it more robust.

To me, logging and audit trail should be separated. So, I try not to store who and when someone ran a stored procedure in my application.

What do we log?
1. Execution time
a. start and end execution time of a stored procedure
b. start and end execution time of a step in a stored procedure
2. Error
a. catch and store error information for troubleshooting later

Why do we log?
1. Enable us to identify slow running stored procedures
2. Give us some hints to take proactive actions before a stored procedure screws up our application/database/server
3. Resolve errors that occurred during run-time (improve application quality by removing possible flaw in the stored procedure)

When do we log?
IMO, we don't log all stored procedures. So, I would log stored procedures that only involve operations below:
1. ETL stored procedure
2. Report stored procedure
3. Stored procedures that contain complex business logic
For those stored procedures that need to be executed FAST and required to support very high concurrency, I don't recommend to add logging to the stored procedure. Each byte that we use for disk IO access is very important. For the same reason, each variable that we declare in the stored procedure does consume memory. Well, we shouldn't discuss about potential performance issues in this blog.

My requirements of logging
1. Able to be turned on/off at stored procedure level
2. Able to store the log information in table or output the log information on screen immediately
3. Able to maintain the link between execution step info and error message
4. Fast - only sequential INSERTs (perhaps minimum UPDATE is acceptable)
5. Log data is easy to analyze

Open for discussion
1. Do you prefer storing logging tables on separate data files?
2. Would you centralize all logging in a dedicated database/server?

Any comment is most welcome. :)

Codes:
IF NOT EXISTS(SELECT 1 FROM sys.schemas WHERE name='ERROR')

EXECUTE ('CREATE SCHEMA ERROR')
GO
IF OBJECT_ID('ERROR.error') IS NOT NULL
DROP TABLE ERROR.error
GO
CREATE TABLE ERROR.error(
ErrorDateTime SMALLDATETIME NOT NULL CONSTRAINT DF_ErrorDateTime DEFAULT GETUTCDATE(),
ErrorNumber INT NOT NULL,
ErrorSeverity INT NOT NULL,
ErrorState INT NOT NULL,
ErrorLine INT NOT NULL,
traceSummaryID INT NOT NULL,
ErrorProcedure NVARCHAR(126) NOT NULL CONSTRAINT DF_ErrorProcedure DEFAULT '',
ErrorMessage NVARCHAR(2048) NOT NULL CONSTRAINT DF_ErrorMessage DEFAULT ''
)
GO
IF OBJECT_ID('ERROR.INS_error') IS NOT NULL
DROP PROCEDURE ERROR.INS_error
GO
CREATE PROCEDURE ERROR.INS_error
@iTraceSummaryID INT,
@iPrintOnScreen BIT=0 --developer wants to see the output on screen only
AS
BEGIN
SET NOCOUNT ON;
IF @iPrintOnScreen=1
PRINT 'Err '+CONVERT(VARCHAR,ERROR_NUMBER())+',Severity '+CONVERT(VARCHAR,ERROR_SEVERITY())+', State '+CONVERT(VARCHAR,ERROR_STATE())+', Line '+CONVERT(VARCHAR,ERROR_LINE())+' - '+ERROR_MESSAGE()
ELSE
INSERT INTO ERROR.error(ErrorNumber,ErrorSeverity,ErrorState,ErrorProcedure,ErrorLine,ErrorMessage,traceSummaryID)
SELECT
ISNULL(ERROR_NUMBER(),0) AS ErrorNumber
,ISNULL(ERROR_SEVERITY(),0) AS ErrorSeverity
,ISNULL(ERROR_STATE(),0) AS ErrorState
,ISNULL(ERROR_PROCEDURE(),'') AS ErrorProcedure
,ISNULL(ERROR_LINE(),0) AS ErrorLine
,ISNULL(ERROR_MESSAGE(),'') AS ErrorMessage
,ISNULL(@iTraceSummaryID,0) AS traceSummaryID;
END
GO
IF NOT EXISTS(SELECT 1 FROM sys.schemas WHERE name='CONFIG')
EXECUTE ('CREATE SCHEMA CONFIG')
GO
IF OBJECT_ID('CONFIG.procReturnCode') IS NOT NULL
DROP TABLE CONFIG.procReturnCode
GO
-- List of Return Codes that must be same across all environments
CREATE TABLE CONFIG.procReturnCode(
procReturnCodeID TINYINT NOT NULL,
name NVARCHAR(120) NOT NULL
)
GO
INSERT INTO CONFIG.procReturnCode(procReturnCodeID,name)
SELECT 0,'OK' UNION ALL
SELECT 10,'Error-Schema' UNION ALL
SELECT 30,'Error-Business Logic' UNION ALL
SELECT 100,'Error-Data Validation' UNION ALL
SELECT 240,'Error-Others';
GO
IF NOT EXISTS(SELECT 1 FROM sys.schemas WHERE name='LOGGING')
EXECUTE ('CREATE SCHEMA LOGGING')
GO
IF OBJECT_ID('LOGGING.traceSummary') IS NOT NULL
DROP TABLE LOGGING.traceSummary
GO
CREATE TABLE LOGGING.traceSummary(
traceSummaryID INT IDENTITY(-2147483648,1) CONSTRAINT PK_traceSummary PRIMARY KEY,
procName NVARCHAR(261) NOT NULL
)
GO
IF OBJECT_ID('LOGGING.traceDetail') IS NOT NULL
DROP TABLE LOGGING.traceDetail
GO
CREATE TABLE LOGGING.traceDetail(
traceSummaryID INT NOT NULL,
step TINYINT NOT NULL CONSTRAINT DF_traceDetail_step DEFAULT 0,
traceDateTime DATETIME NOT NULL CONSTRAINT DF_traceDetail_traceDateTime DEFAULT GETUTCDATE(),
stepName NVARCHAR(128) NOT NULL CONSTRAINT DF_traceDetail_stepName DEFAULT ''
)
GO
CREATE NONCLUSTERED INDEX IDX_LOGGING_traceDetail_1 ON LOGGING.traceDetail(traceSummaryID) INCLUDE(step)
GO
IF OBJECT_ID('LOGGING.INS_trace') IS NOT NULL
DROP PROCEDURE LOGGING.INS_trace
GO
CREATE PROCEDURE LOGGING.INS_trace
@iTraceSummaryID INT,
@iProcName NVARCHAR(261),
@iStepName NVARCHAR(128)='', --optional, for developer to mark a step only
@iPrintOnScreen BIT=0 --developer wants to see the output on screen only
AS
BEGIN
IF @iPrintOnScreen=1
BEGIN
PRINT CONVERT(VARCHAR,GETUTCDATE(),121)+'-'+@iStepName
RETURN @iTraceSummaryID
END
IF @iTraceSummaryID=0
BEGIN
INSERT INTO LOGGING.traceSummary(procName) VALUES (@iProcName);
SET @iTraceSummaryID=SCOPE_IDENTITY();
END
INSERT INTO LOGGING.traceDetail(traceSummaryID,stepName,step)
SELECT @iTraceSummaryID,@iStepName
,(SELECT ISNULL(MAX(step),0)+1 FROM LOGGING.traceDetail WITH(NOLOCK) WHERE traceSummaryID=@iTraceSummaryID)

RETURN @iTraceSummaryID
END
GO
IF OBJECT_ID('dbo.template_Transaction') IS NOT NULL
DROP PROCEDURE dbo.template_Transaction
GO
CREATE PROCEDURE dbo.template_Transaction
@iParam INT,
@oParam INT OUTPUT
AS
BEGIN TRY
SET NOCOUNT ON;
-- ********************************************************************************************* //
-- Local Variables & Initialization
DECLARE @returnCode TINYINT=0
DECLARE @traceSummaryID INT=0
DECLARE @printOnScreen BIT=0
DECLARE @proc NVARCHAR(261)=QUOTENAME(OBJECT_SCHEMA_NAME(@@PROCID))+'.'+QUOTENAME(OBJECT_NAME(@@PROCID))

-- Logging
EXECUTE @traceSummaryID=LOGGING.INS_trace @traceSummaryID,@proc,'Start',@printOnScreen
-- ********************************************************************************************* //
-- Temporary Tables
-- CREATE TABLE #t(ID INT)
-- DECLARE @t AS TABLE(ID INT)
EXECUTE @traceSummaryID=LOGGING.INS_trace @traceSummaryID,@proc,'Temporary Table',@printOnScreen
-- Business Logic goes here

-- NOTE: Avoid calling LOGGING.INS_trace in the TRANSACTION!
-- 1. reduce LOCK on trace tables
-- 2. when rolled-back, everything is gone
-- Keep transaction small
BEGIN TRANSACTION

END_OF_SP:
COMMIT
-- Logging
EXECUTE @traceSummaryID=LOGGING.INS_trace @traceSummaryID,@proc,'End',@printOnScreen
END TRY
BEGIN CATCH
IF XACT_STATE()= 1 COMMIT
IF XACT_STATE()=-1 ROLLBACK
-- execute everything else after COMMIT/ROLLBACK
EXECUTE ERROR.INS_error @traceSummaryID,@printOnScreen --establish a link between logging and error handling
RETURN @returnCode -- refer CONFIG.procReturnCode
END CATCH
GO

Friday, July 27, 2012

Database: Unique Identifier or Identity

In most of the RDBMS, DBA or DB developer has always considering whether he/she should use Unique Identifier (GUID) or Identity (Auto-Increment, sequence etc).

To be frank, many experienced DBAs have already known the answer long long time ago. Perhaps, some of them might have already posted this topic somewhere else on the Internet.

So, what do experts recommend? IDENTITY, if possible.

Why?
1. IDENTITY column is always unique and sorted in a sequence that we want (ascending or descending)
2. With these unique characteristics, it saves a lot of resources when INSERTING data into the table
i. new data always goes to the bottom of the BTree/Heap
ii. database engine itself does not need to adjust the BTree when new data come in
iii. it is much easier to partition the data (well, it depends on your data partitioning strategy)
iv. if this column is a PK or clustered index, it will impact the storage requirement of your non-clustered indexes. (For MSSQL, Unique Identifier is 16 bytes while BIGINT is 8 bytes) As you know, each non-clustered index will contain your clustered index key.
3. You hardly perform UPDATE on IDENTITY column (and it is not a good practice to update IDENTITY column)

However, there are some challenges that we might face when using IDENTITY
1. Limit of BIGINT is 9,223,372,036,854,775,807
- in BIG DATA era, it becomes possible to hit the limit
2. Data migration between databases is not straight forward
- we have to put extra effort to ensure 100% data integrity

Leo Leong's Blog

Saturday, July 20, 2013

Database - Data Warehouse - Staging Table Design

Staging Table

The Design

Monday, October 8, 2012

Database - SQL Server - Logging

Friday, July 27, 2012

Database: Unique Identifier or Identity

About Me

Blog Archive

SQL Server Links

PHP Links

Followers