Leo Leong's Blog

Tuesday, March 5, 2013

Database - SQL Server - Convert Delimited String into Rows

Scenario:
Developer comes to us with a problem statement: How do we convert a delimited string variable into rows using T-SQL?
For example:
DECLARE @str VARCHAR(200)='aaa,bbb,ccc,ddd'

Expected output:

Result

aaa

bbb

ccc

ddd

Okay, I manage to find 3 methods on the Internet. :D

Method 1
SELECT SUBSTRING(@str,CHARINDEX(',',','+@str+',',number),CHARINDEX(',',','+@str+',',number+1)-CHARINDEX(',',','+@str+',',number)-1)
FROM master.dbo.spt_values
WHERE name IS NULL AND number <= LEN(@str)
AND number=CHARINDEX(',',','+@str+',',number)

Method 2
DECLARE @xml xml
SET @xml=N''+REPLACE(@str,',','
')+''SELECT x.value('.','NVARCHAR(255)')
FROM @xml.node('//root/r's) AS a(x)

Method 3
SET @str='SELECT '''+REPLACE(@str,',',''' UNION ALL SELECT ''')+''''
EXEC(@str)

Somehow, I find that method 3 is the fastest on my machine.
Some benchmark data:

Method	Avg Speed (ms)
Method 1	143
Method 2	33
Method 3	3

Do you get same result on your machine?

Monday, March 4, 2013

Database - SQL Server - Comparing N Column(s)

Background
Recently, I ran into a scenario where my web developer would like to allow his user to download data from a 201-column table into a CSV file. Then, user will modify (add, modify, or remove) records in the CSV and upload the data back to the system again. For each batch, system needs to compare about 100K rows of record between download and upload.
In addition, primary key column will not be downloaded. User can only see 200 columns in the CSV.

Let say, my table looks like this:
CREATE TABLE myTable (
id INT IDENTITY(1,1) PRIMARY KEY,
col1 varchar(100) NULL,
col2 varchar(100) NULL,
col3 varchar(100) NULL,
.
.
.
col200 varchar(100) NULL
)

What is the fastest way to identity changes?

Solutions
I thought of 3 possible methods. Well, you may have more solutions. Please feel free to share with me.
Well, first step is to upload the file via BCP or BULK INSERT into a temporary table called #upload.

Method 1
Use pure FULL OUTER JOIN

SELECT t1.col1, t2.col1,t1.col2,t2.col2,t1.col3,t2.col3,.....,t1.col200,t2.col200
FROM myTable t1 FULL OUTER JOIN #upload t2
ON t1.col1=t2.col1 AND t1.col2=t2.col2 AND t1.col3=t2.col3 ... AND t1.col200=t2.col200
WHERE t1.col1 IS NULL OR t2.col1 IS NULL

Method 2
Use EXCEPT

SELECT col1,col2,col3,...,col200 FROM myTable
EXCEPT
SELECT col1,col2,col3,...,col200 FROM #upload

SELECT col1,col2,col3,...,col200 FROM #upload
EXCEPT
SELECT col1,col2,col3,...,col200 FROM myTable

By combining results from 2 EXCEPT statements above, we are able to get the delta records. Easy and clean.

Method 3
Use additional MD5 computed column + FULL OUTER JOIN

In order to use this method, we need to create a persisted/non-persisted computed column on myTable and #upload. The syntax should look like this (you can change the delimiter if you want):

ALTER TABLE myTable ADD md5 AS HASHBYTES(md5, col1+'||'+col2+'||'+col3+'||'+...+'||'+col200)
ALTER TABLE #upload ADD md5 AS HASHBYTES(md5, col1+'||'+col2+'||'+col3+'||'+...+'||'+col200)

If it is a persisted column, we can create an index on our md5 column. Anyway, our join will become like this:

SELECT t1.md5,t2.md5 FROM myTable t1 FULL OUTER JOIN #upload t2 ON t1.md5=t2.md5
WHERE t1.md5 IS NULL OR t2.md5 IS NULL

Assumptions
1. User is not allowed to delete any column in the CSV file.
2. Number of columns must be greater than "certain" number. I didn't have time to try it out. However, if there are only 3 to 5 columns, it does not worth the effort to use method 3. :)

Result
Based on my own benchmark (your machine may give you different result though), method 3 runs much faster than method 2. Method 1 is the slowest. Also, it is not possible to create an index on those 200 columns.
I have tried method 3 on a 23-column and a 201-column tables. Both works perfectly fast for me.

Please feel free to post your comment if you find any of the method works for you. Cheers! ;)

Monday, October 8, 2012

Database - SQL Server - Logging

I would like to talk about stored procedure logging today. Many DBAs or database developers always have a so called "logging framework" in their applications. I would like to share my "logging framework" in this blog. I hope you guys can provide some thought to make it more robust.

To me, logging and audit trail should be separated. So, I try not to store who and when someone ran a stored procedure in my application.

What do we log?
1. Execution time
a. start and end execution time of a stored procedure
b. start and end execution time of a step in a stored procedure
2. Error
a. catch and store error information for troubleshooting later

Why do we log?
1. Enable us to identify slow running stored procedures
2. Give us some hints to take proactive actions before a stored procedure screws up our application/database/server
3. Resolve errors that occurred during run-time (improve application quality by removing possible flaw in the stored procedure)

When do we log?
IMO, we don't log all stored procedures. So, I would log stored procedures that only involve operations below:
1. ETL stored procedure
2. Report stored procedure
3. Stored procedures that contain complex business logic
For those stored procedures that need to be executed FAST and required to support very high concurrency, I don't recommend to add logging to the stored procedure. Each byte that we use for disk IO access is very important. For the same reason, each variable that we declare in the stored procedure does consume memory. Well, we shouldn't discuss about potential performance issues in this blog.

My requirements of logging
1. Able to be turned on/off at stored procedure level
2. Able to store the log information in table or output the log information on screen immediately
3. Able to maintain the link between execution step info and error message
4. Fast - only sequential INSERTs (perhaps minimum UPDATE is acceptable)
5. Log data is easy to analyze

Open for discussion
1. Do you prefer storing logging tables on separate data files?
2. Would you centralize all logging in a dedicated database/server?

Any comment is most welcome. :)

Codes:
IF NOT EXISTS(SELECT 1 FROM sys.schemas WHERE name='ERROR')

EXECUTE ('CREATE SCHEMA ERROR')
GO
IF OBJECT_ID('ERROR.error') IS NOT NULL
DROP TABLE ERROR.error
GO
CREATE TABLE ERROR.error(
ErrorDateTime SMALLDATETIME NOT NULL CONSTRAINT DF_ErrorDateTime DEFAULT GETUTCDATE(),
ErrorNumber INT NOT NULL,
ErrorSeverity INT NOT NULL,
ErrorState INT NOT NULL,
ErrorLine INT NOT NULL,
traceSummaryID INT NOT NULL,
ErrorProcedure NVARCHAR(126) NOT NULL CONSTRAINT DF_ErrorProcedure DEFAULT '',
ErrorMessage NVARCHAR(2048) NOT NULL CONSTRAINT DF_ErrorMessage DEFAULT ''
)
GO
IF OBJECT_ID('ERROR.INS_error') IS NOT NULL
DROP PROCEDURE ERROR.INS_error
GO
CREATE PROCEDURE ERROR.INS_error
@iTraceSummaryID INT,
@iPrintOnScreen BIT=0 --developer wants to see the output on screen only
AS
BEGIN
SET NOCOUNT ON;
IF @iPrintOnScreen=1
PRINT 'Err '+CONVERT(VARCHAR,ERROR_NUMBER())+',Severity '+CONVERT(VARCHAR,ERROR_SEVERITY())+', State '+CONVERT(VARCHAR,ERROR_STATE())+', Line '+CONVERT(VARCHAR,ERROR_LINE())+' - '+ERROR_MESSAGE()
ELSE
INSERT INTO ERROR.error(ErrorNumber,ErrorSeverity,ErrorState,ErrorProcedure,ErrorLine,ErrorMessage,traceSummaryID)
SELECT
ISNULL(ERROR_NUMBER(),0) AS ErrorNumber
,ISNULL(ERROR_SEVERITY(),0) AS ErrorSeverity
,ISNULL(ERROR_STATE(),0) AS ErrorState
,ISNULL(ERROR_PROCEDURE(),'') AS ErrorProcedure
,ISNULL(ERROR_LINE(),0) AS ErrorLine
,ISNULL(ERROR_MESSAGE(),'') AS ErrorMessage
,ISNULL(@iTraceSummaryID,0) AS traceSummaryID;
END
GO
IF NOT EXISTS(SELECT 1 FROM sys.schemas WHERE name='CONFIG')
EXECUTE ('CREATE SCHEMA CONFIG')
GO
IF OBJECT_ID('CONFIG.procReturnCode') IS NOT NULL
DROP TABLE CONFIG.procReturnCode
GO
-- List of Return Codes that must be same across all environments
CREATE TABLE CONFIG.procReturnCode(
procReturnCodeID TINYINT NOT NULL,
name NVARCHAR(120) NOT NULL
)
GO
INSERT INTO CONFIG.procReturnCode(procReturnCodeID,name)
SELECT 0,'OK' UNION ALL
SELECT 10,'Error-Schema' UNION ALL
SELECT 30,'Error-Business Logic' UNION ALL
SELECT 100,'Error-Data Validation' UNION ALL
SELECT 240,'Error-Others';
GO
IF NOT EXISTS(SELECT 1 FROM sys.schemas WHERE name='LOGGING')
EXECUTE ('CREATE SCHEMA LOGGING')
GO
IF OBJECT_ID('LOGGING.traceSummary') IS NOT NULL
DROP TABLE LOGGING.traceSummary
GO
CREATE TABLE LOGGING.traceSummary(
traceSummaryID INT IDENTITY(-2147483648,1) CONSTRAINT PK_traceSummary PRIMARY KEY,
procName NVARCHAR(261) NOT NULL
)
GO
IF OBJECT_ID('LOGGING.traceDetail') IS NOT NULL
DROP TABLE LOGGING.traceDetail
GO
CREATE TABLE LOGGING.traceDetail(
traceSummaryID INT NOT NULL,
step TINYINT NOT NULL CONSTRAINT DF_traceDetail_step DEFAULT 0,
traceDateTime DATETIME NOT NULL CONSTRAINT DF_traceDetail_traceDateTime DEFAULT GETUTCDATE(),
stepName NVARCHAR(128) NOT NULL CONSTRAINT DF_traceDetail_stepName DEFAULT ''
)
GO
CREATE NONCLUSTERED INDEX IDX_LOGGING_traceDetail_1 ON LOGGING.traceDetail(traceSummaryID) INCLUDE(step)
GO
IF OBJECT_ID('LOGGING.INS_trace') IS NOT NULL
DROP PROCEDURE LOGGING.INS_trace
GO
CREATE PROCEDURE LOGGING.INS_trace
@iTraceSummaryID INT,
@iProcName NVARCHAR(261),
@iStepName NVARCHAR(128)='', --optional, for developer to mark a step only
@iPrintOnScreen BIT=0 --developer wants to see the output on screen only
AS
BEGIN
IF @iPrintOnScreen=1
BEGIN
PRINT CONVERT(VARCHAR,GETUTCDATE(),121)+'-'+@iStepName
RETURN @iTraceSummaryID
END
IF @iTraceSummaryID=0
BEGIN
INSERT INTO LOGGING.traceSummary(procName) VALUES (@iProcName);
SET @iTraceSummaryID=SCOPE_IDENTITY();
END
INSERT INTO LOGGING.traceDetail(traceSummaryID,stepName,step)
SELECT @iTraceSummaryID,@iStepName
,(SELECT ISNULL(MAX(step),0)+1 FROM LOGGING.traceDetail WITH(NOLOCK) WHERE traceSummaryID=@iTraceSummaryID)

RETURN @iTraceSummaryID
END
GO
IF OBJECT_ID('dbo.template_Transaction') IS NOT NULL
DROP PROCEDURE dbo.template_Transaction
GO
CREATE PROCEDURE dbo.template_Transaction
@iParam INT,
@oParam INT OUTPUT
AS
BEGIN TRY
SET NOCOUNT ON;
-- ********************************************************************************************* //
-- Local Variables & Initialization
DECLARE @returnCode TINYINT=0
DECLARE @traceSummaryID INT=0
DECLARE @printOnScreen BIT=0
DECLARE @proc NVARCHAR(261)=QUOTENAME(OBJECT_SCHEMA_NAME(@@PROCID))+'.'+QUOTENAME(OBJECT_NAME(@@PROCID))

-- Logging
EXECUTE @traceSummaryID=LOGGING.INS_trace @traceSummaryID,@proc,'Start',@printOnScreen
-- ********************************************************************************************* //
-- Temporary Tables
-- CREATE TABLE #t(ID INT)
-- DECLARE @t AS TABLE(ID INT)
EXECUTE @traceSummaryID=LOGGING.INS_trace @traceSummaryID,@proc,'Temporary Table',@printOnScreen
-- Business Logic goes here

-- NOTE: Avoid calling LOGGING.INS_trace in the TRANSACTION!
-- 1. reduce LOCK on trace tables
-- 2. when rolled-back, everything is gone
-- Keep transaction small
BEGIN TRANSACTION

END_OF_SP:
COMMIT
-- Logging
EXECUTE @traceSummaryID=LOGGING.INS_trace @traceSummaryID,@proc,'End',@printOnScreen
END TRY
BEGIN CATCH
IF XACT_STATE()= 1 COMMIT
IF XACT_STATE()=-1 ROLLBACK
-- execute everything else after COMMIT/ROLLBACK
EXECUTE ERROR.INS_error @traceSummaryID,@printOnScreen --establish a link between logging and error handling
RETURN @returnCode -- refer CONFIG.procReturnCode
END CATCH
GO

Friday, July 27, 2012

Database: Unique Identifier or Identity

In most of the RDBMS, DBA or DB developer has always considering whether he/she should use Unique Identifier (GUID) or Identity (Auto-Increment, sequence etc).

To be frank, many experienced DBAs have already known the answer long long time ago. Perhaps, some of them might have already posted this topic somewhere else on the Internet.

So, what do experts recommend? IDENTITY, if possible.

Why?
1. IDENTITY column is always unique and sorted in a sequence that we want (ascending or descending)
2. With these unique characteristics, it saves a lot of resources when INSERTING data into the table
i. new data always goes to the bottom of the BTree/Heap
ii. database engine itself does not need to adjust the BTree when new data come in
iii. it is much easier to partition the data (well, it depends on your data partitioning strategy)
iv. if this column is a PK or clustered index, it will impact the storage requirement of your non-clustered indexes. (For MSSQL, Unique Identifier is 16 bytes while BIGINT is 8 bytes) As you know, each non-clustered index will contain your clustered index key.
3. You hardly perform UPDATE on IDENTITY column (and it is not a good practice to update IDENTITY column)

However, there are some challenges that we might face when using IDENTITY
1. Limit of BIGINT is 9,223,372,036,854,775,807
- in BIG DATA era, it becomes possible to hit the limit
2. Data migration between databases is not straight forward
- we have to put extra effort to ensure 100% data integrity

Tuesday, May 22, 2012

Database Backup - Differences between Oracle, MSSQL, and MySQL

Oracle 10g
(refer http://docs.oracle.com/cd/A57673_01/DOC/server/doc/SCN73/ch23.htm)

A. Full Backup
i) NOARCHIVELOG mode - best for one-time database restore e.g. create testing/development database
ii) ARCHIVELOG mode - best for database recovery

B. Partial Backup
i) backup all datafiles under ONE tablespace
ii) backup SINGLE datafile
iii) backup a control file
> useful under ARCHIVELOG mode so that upon restore, changes will be applied based on archived redo log.

C. Incremental Backup
i) differential incremental backup - changes in data blocks since most recent backup at the same level or lower
ii) cumulative incremental backup - changes in data blocks since most recent backup of the next lowest level

MSSQL 2005/2008/2012
Concepts of Recovery Model
i) Simple - no support for log backup
ii) Bulk-Logged - minimal logging for bulk operation. DMLs are still logged. Bulk operation after last backup cannot be recovered.
iii) Full - support point-in-time recovery
(refer http://msdn.microsoft.com/en-us/library/ms189275.aspx)

A. Full Backup - data backup (all online data files and transaction logs) at certain point of time.
- if there is any offline datafile, full backup operation will fail. administrator has to perform partial backup instead.
(refer http://msdn.microsoft.com/en-us/library/ms175477.aspx)

B. Log Backup - changes in transaction log based on last full/differential/log backup (only available when recovery model is set to FULL or Bulk-Logged)
(refer http://msdn.microsoft.com/en-us/library/ms191429.aspx)

C. File Backup - backup all the data in one or more online files or filegroups (a filegroup that contains 1 or more offline datafile will be considered as offline filegroup. therefore, it cannot be backed up)
(refer http://msdn.microsoft.com/en-us/library/ms189860.aspx)

D. Partial Backup - useful when we want to exclude read-only filegroups
(refer http://msdn.microsoft.com/en-us/library/ms191539.aspx)

E. Differential Backup - changes since latest full/partial backup (a.k.a. cumulative incremental backup)

MySQL 5.0/5.1/5.5
(refer http://dev.mysql.com/doc/refman/5.0/en/backup-types.html)

A. Full Backup
- complete database backup
- Command: mysqldump (for MyISAM/Archive table, can use mysqlhotcopy)
- does not include INFORMATION_SCHEMA
- does not include PERFORMANCE_SCHEMA
- does not include ndbinfo
- does not include log/configuration file
(refer http://dev.mysql.com/doc/refman/5.5/en/mysqldump.html and http://dev.mysql.com/doc/refman/5.5/en/mysqlhotcopy.html)

B. Incremental Backup
- MUST enable --log-bin on startup (or specify log_bin and expire_log_days in my.cnf)
- flush bin log to file for database recovery later (similar to log backup)

C. Snapshot Backup - capability of file system. provided by third party (Veritas, LVM, ZFS)

Note: physical backup vs logical backup, online backup vs offline backup, import/export data and compressed backup are not covered in this article.

next post will discuss how to use database backups to perform recovery.

Sunday, April 22, 2012

PHP Yii SWFUpload 302

Possible Root Cause:
SWFUpload opens a new PHP session when connecting to Apache and uploading file(s) to web server's file system.

Solution:
1. Remember to pass in postParams
  $this->widget('application.extensions.swfupload.CSwfUpload', array(
'jsHandlerUrl'=>'js/swfupload/handlers.js',
'postParams'=>array('PHPSESSID'=>session_id()),
'config'=>array( ...

2. In the Controller, modify accessRules() to allow any user to call the action that is going to be used by SWFUpload.
Below is are sample codes in my View file:

$uploadUrl=$this->createUrl('photo/swfupload');
$this->widget('application.extensions.swfupload.CSwfUpload', array(
'jsHandlerUrl'=>'js/swfupload/handlers.js',
'postParams'=>array('PHPSESSID'=>session_id()),
'config'=>array(
'debug'=>false,
'use_query_string'=>true,
'upload_url'=>$uploadUrl,
'file_size_limit'=>'5 MB',
'file_types'=>'*.jpg;*.png;*.gif',
'file_types_description'=>'Image Files',
'file_upload_limit'=>0,
'file_queue_error_handler'=>'js:fileQueueError',
'file_dialog_complete_handler'=>'js:fileDialogComplete',
'upload_progress_handler'=>'js:uploadProgress',
'upload_error_handler'=>'js:uploadError',
'upload_success_handler'=>'js:uploadSuccess',
'upload_complete_handler'=>'js:uploadComplete',
'custom_settings'=>array('progressTarget'=>'divFileProgressContainer','cancelButtonId'=>'btnCancel'),
'button_placeholder_id'=>'swfupload',
  'button_image_url'=>'js/swfupload/images/TestImageNoText_100x20.png',
'button_width'=>50,
'button_height'=>20,
'button_text'=>' Select File(s) ',
'button_text_style'=>'.button { font-family: "Lucida Sans Unicode", "Lucida Grande", sans-serif; font-size: 9pt; text-align: center; }',
'button_text_top_padding'=>0,
'button_text_left_padding'=>0,
'button_window_mode'=>'js:SWFUpload.WINDOW_MODE.TRANSPARENT',
'button_cursor'=>'js:SWFUpload.CURSOR.HAND',
),
)
);
?>

$uploadUrl); ?>

As you can see, photo/swfupload is my controller/action pair。
So, in photoController file, we shall change the accessRules() function to allow anyone to call actionSwfupload(). Below are my sample codes:
public function accessRules()
{
return array(
array('allow', // allow all users to perform 'index' and 'view' actions
'actions'=>array('index','view','swfupload'),
'users'=>array('*'),
),
);
}

Please feel free to share your experience on this issue.

Monday, April 2, 2012

PHP Yii Complex SQL

Yii allows developers to construct SQL statement.
This will help advanced developer to tune his/her SQL query (by using appropriate indexes).

Example:
$connection=Yii::app()->db;

$sql = "SELECT t1.colA, t2.colB, t3.colC, t3.colD
FROM tableA t1
LEFT JOIN tableB t2 ON t1.id=t2.id
LEFT JOIN tableC t3 ON t2.pid=t3.pid
WHERE t1.id BETWEEN 10 AND 100
ORDER BY t1.modifyDate DESC";
$command = $connection->createCommand($sql);
$dataReader = $command->query();
$result = $dataReader->readAll();

Internal:
$command=createCommand($query)
Note: $command is a CDbCommand
|- $dataReader = query()
Note: $dataReader is a CDbDataReader
|- CDBDataReader->readAll() returns array()

Sunday, April 1, 2012

PHP Yii ActiveRecord FindAllBySql()

Definition:
public array findAllBySql(string $sql, array $params=array ( ))
(refer http://www.yiiframework.com/doc/api/1.1/CActiveRecord#findAllBySql-detail)

Explanation:
Find all active records using specified SQL statement

Example:
1. $model = MyModel::model()->findAllBySql('SELECT colA, colB FROM any_table');
2. $model = MyModel::model()->findAllBySql($sqlString);
3. $model = MyModel::model()->findAllBySql('SELECT * FROM any_table WHERE colA=:param', array(':param'=>$value));

Internal:

findAllBySql($sql, $param)
|-$command = createSqlCommand($sql,$params)
Note: $command is a CDbCommand
|-populateRecords($command->queryAll())