Recently, I ran into a scenario where my web developer would like to allow his user to download data from a 201-column table into a CSV file. Then, user will modify (add, modify, or remove) records in the CSV and upload the data back to the system again. For each batch, system needs to compare about 100K rows of record between download and upload.
In addition, primary key column will not be downloaded. User can only see 200 columns in the CSV.
Let say, my table looks like this:
CREATE TABLE myTable (
id INT IDENTITY(1,1) PRIMARY KEY,
col1 varchar(100) NULL,
col2 varchar(100) NULL,
col3 varchar(100) NULL,
.
.
.
col200 varchar(100) NULL
)
What is the fastest way to identity changes?
Solutions
I thought of 3 possible methods. Well, you may have more solutions. Please feel free to share with me.
Well, first step is to upload the file via BCP or BULK INSERT into a temporary table called #upload.
Method 1
Use pure FULL OUTER JOIN
SELECT t1.col1, t2.col1,t1.col2,t2.col2,t1.col3,t2.col3,.....,t1.col200,t2.col200
FROM myTable t1 FULL OUTER JOIN #upload t2
ON t1.col1=t2.col1 AND t1.col2=t2.col2 AND t1.col3=t2.col3 ... AND t1.col200=t2.col200
WHERE t1.col1 IS NULL OR t2.col1 IS NULL
Method 2
Use EXCEPT
SELECT col1,col2,col3,...,col200 FROM myTable
EXCEPT
SELECT col1,col2,col3,...,col200 FROM #upload
SELECT col1,col2,col3,...,col200 FROM #upload
EXCEPT
SELECT col1,col2,col3,...,col200 FROM myTable
By combining results from 2 EXCEPT statements above, we are able to get the delta records. Easy and clean.
Method 3
Use additional MD5 computed column + FULL OUTER JOIN
In order to use this method, we need to create a persisted/non-persisted computed column on myTable and #upload. The syntax should look like this (you can change the delimiter if you want):
ALTER TABLE myTable ADD md5 AS HASHBYTES(md5, col1+'||'+col2+'||'+col3+'||'+...+'||'+col200)
ALTER TABLE #upload ADD md5 AS HASHBYTES(md5, col1+'||'+col2+'||'+col3+'||'+...+'||'+col200)
If it is a persisted column, we can create an index on our md5 column. Anyway, our join will become like this:
SELECT t1.md5,t2.md5 FROM myTable t1 FULL OUTER JOIN #upload t2 ON t1.md5=t2.md5
WHERE t1.md5 IS NULL OR t2.md5 IS NULL
Assumptions
1. User is not allowed to delete any column in the CSV file.
2. Number of columns must be greater than "certain" number. I didn't have time to try it out. However, if there are only 3 to 5 columns, it does not worth the effort to use method 3. :)
Result
Based on my own benchmark (your machine may give you different result though), method 3 runs much faster than method 2. Method 1 is the slowest. Also, it is not possible to create an index on those 200 columns.
I have tried method 3 on a 23-column and a 201-column tables. Both works perfectly fast for me.
Please feel free to post your comment if you find any of the method works for you. Cheers! ;)
No comments:
Post a Comment