Leo Leong's Blog

Wednesday, June 12, 2019

Initial Data Analysis (IDA)

Reference

Introduction

IDA aims to inspect and inject data prior to main data analysis stage.
Stages:

Data quality check
Data transformation
Data randomization
Data characteristics documentation

Data Quality Check

Assessment types:

Frequency counts
Descriptive statistics (mean, standard deviation, median)
Normality (skewness, kurtosis, frequency histograms)

Types of Data issue:

Duplicate record
Inconsistent date and time stamps
Outliers
Missing values

Data Transformation

Assessment types:

Square root transformation
Log-transformation
Inverse transformation
Make categorical

Data Randomization

Randomize the data and prove that sample data agree with the original intentions
Methods:

Generate random permutation of the data
Select random sample of the data

Data Characteristics Documentation

Changes (modified/removed/manipulated) to the original data
Shape of the distribution of variables
Error rates/patterns
Criteria to detect abnormality

Wednesday, May 22, 2019

Python Dictionary

Links: Journey to Data Scientist

Characteristics

Unordered (followed the sequence the keys were added)
Mutable
Can be nested
Access via key

#create an empty dictionary
d = {}
print(type(d))

Output: <class 'dict'>

d = {'key1':'value1',2:'value2',0.01:201}
print(d)
print(d['key1'])
print(d[2])
print(d[0.01])

Output:
{'key1': 'value1', 2: 'value2', 0.01: 201}
value1
value2
201

d[2] = 'update value'
print(d[2])

Output: update value

for k, v in d.items():
print(k, ":", v)

Output:
key1 : value1
2 : update value
0.01 : 201

if 'key1' in d:
print(d['key1'])

Output: value1

d['add'] = 1990
print(d)

Output: {'key1': 'value1', 2: 'update value', 0.01: 201, 'add': 1990}

d.pop('add')
print(d)

Output: {'key1': 'value1', 2: 'update value', 0.01: 201}

print(d)
d.clear()
print(d)

Output:
{'key1': 'value1', 2: 'value2', 0.01: 201}
{}

Tuesday, May 21, 2019

Python Sets

Links: Journey to Data Scientist

Characteristics

Unordered
Hashable
Unique
Immutable
Support mathematical operations like union, intersection, difference, and symmetric difference

1. Unordered (in fact, output is ordered by hashed value)

Example #1
s = {"abc",123,9.99,"DEF",3.142,100}
print(s)
print(type(s))

Output:
{3.142, 100, 9.99, 'abc', 'DDf', 123}
<class 'set'>

Example #2
s = s = {"abc","123","9.99","DDf","3.142","100"}
print(s)
print(type(s))

Output:
{'100', 'abc', 'DDf', '3.142', '9.99', '123'}
<class 'set'>

Example #1 and #2 above show different ordering of output compare to their original definition. This is because Python will hash the elements first. Different data type will use different algorithm.

2. Common Operations

s = set() #empty set
print(s)
s = {"abc",123,9.99,"DDf",3.142,100} #100 will be the first one to be pop-ed
print(s)
print("Length of s: ",len(s)) #number of element in set
s.add("one item") #add one element
print("Added 'one item':",s)
s.update(("i1","i2","i3")) #add multiple items
print("Updated 'i1, i2, i3':",s)
s.remove(3.142)
print("Removed 3.142: ",s)
s.discard(99999) #this will not trigger error
print("Discarded 99999 (no effect): ",s)
s.discard("DDf")
print("Discarded DDf: ",s)
s.pop()
print("Poped last item (in original definition): ",s)
s.clear() #make it an empty set
print(s)

s.remove(99999) #this will trigger error because 99999 does not exist in the set
del s #this will release 's' from memory, this variable will become undefined

Output:
set()
{3.142, 100, 9.99, 'abc', 'DDf', 123}
Length of s: 6
Added 'one item': {3.142, 100, 'one item', 9.99, 'abc', 'DDf', 123}
Updated 'i1, i2, i3': {3.142, 100, 'one item', 'i2', 9.99, 'abc', 'DDf', 'i1', 123, 'i3'}
Removed 3.142: {100, 'one item', 'i2', 9.99, 'abc', 'DDf', 'i1', 123, 'i3'}
Discarded 99999 (no effect): {100, 'one item', 'i2', 9.99, 'abc', 'DDf', 'i1', 123, 'i3'}
Discarded DDf: {100, 'one item', 'i2', 9.99, 'abc', 'i1', 123, 'i3'}
Poped last item (in original definition): {'one item', 'i2', 9.99, 'abc', 'i1', 123, 'i3'}
set()

2. Mathematical Operations

s1 = {1,3,4,8,9}
s2 = {5,7,6,8,9}

print("Difference: ",s1.difference(s2))
print("Intersection: ",s1.intersection(s2))
print("IsDisjoint: ",s1.isdisjoint(s2))
print("IsSubSet: ",s1.issubset(s2))
print("IsSuperSet: ",s1.issuperset(s2))
print("Symmetric Difference: ",s1.symmetric_difference(s2))
print("Union: ",s1.union(s2))

Output:
Difference: {1, 3, 4}
Intersection: {8, 9}
IsDisjoint: False
IsSubSet: False
IsSuperSet: False
Symmetric Difference: {1, 3, 4, 5, 6, 7}
Union: {1, 3, 4, 5, 6, 7, 8, 9}

s1 = {1,3,4,8,9}
s2 = {5,7,6,8,9}

s1.difference_update(s2)
print("After Difference Update: ",s1)

Output: After Difference Update: {1, 3, 4}

s1 = {1,3,4,8,9}
s2 = {5,7,6,8,9}

s1.intersection_update(s2)
print("After Intersection Update: ",s1)

Output: After Intersection Update: {8, 9}

s1 = {1,3,4,8,9}
s2 = {5,7,6,8,9}

s1.symmetric_difference_update(s2)
print("After Symmetric Difference Update: ",s1)

Output: After Symmetric Difference Update: {1, 3, 4, 5, 6, 7}

Wednesday, May 15, 2019

Python Tuples

Links: Journey to Data Scientist

Characteristics

Ordered
Can contain arbitrary objects
Accesible via index
Can be nested
Immutable

It shares most of the characteristics as Lists except it is immutable.
Therefore, operations on Tuples are faster than Lists (when number of elements is getting larger)

Immutable

t = ("apple","moon","3",57,9.99,0x3e)
print(t)
print(t[1])
print(t[3:6])
print(t[::4])

Output:
('apple', 'moon', '3', 57, 9.99, 62)
moon
(57, 9.99, 62)
('apple', 9.99)

If we try to modify the value:
t[0] = "orange"
Output:
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-9456d7af7fd3> in <module>
----> 1 t[0] = "orange"

TypeError: 'tuple' object does not support item assignment

Unpacking Tuples

t = ("apple","moon","3",57,9.99,0x3e)
(u1,u2,u3,u4,u5,u6) = t
print(u1,u2,u3,u4,u5,u6)

Output: apple moon 3 57 9.99 62

Tuesday, May 14, 2019

Journey to Data Scientist

Python

Basics

List | Tuple| Set| Dictionary
Lambda
Array
Object Oriented Programming (OOP)
Iterator
JSon
Regular Expression
File Operations
Database Operations

Numpy

Array
Universal Functions
Broadcasting

Pandas

Series
DataFrame
Pivot Table

Visualization

matplotlib.pyplot
seaborn
bokeh

Statistics

Initial Data Analysis (IDA)
Exploratory Data Analysis (EDA)
Confirmatory Data Analysis (CDA)
Probabilistic Logic and Statistical Inference
Parameter estimation by optimization
Dimension Reduction

Machine Learning

Supervised
Unsupervised
Reinforced

References

7 Steps to Mastering Data Preparation with Python (by Matthew Mayo)

Python Lists

Links: Journey to Data Scientist

Characteristics

Ordered
Can contain arbitrary objects
Access via index
Can be nested
Mutable

1. Ordered

l = [21,5,3,18,34,67,9,12,88,73,65,13]
print(l)

Output: [21, 5, 3, 18, 34, 67, 9, 12, 88, 73, 65, 13]

2. Can contain arbitrary objects

l = [21,0x1e2f,'this is a string',3.142,8.176J,len('99999')]
print(l)

Output: [21, 7727, 'this is a string', 3.142, 8.176j, 5]

3. Access via index

l = [21,5,3,18,34,67,9,12,88,73,65,13]
print(l[0],l[-1])

Output: 21 13

4. Can be nested

l = [21,5,[3,18,34,[67,9,12],88,73],65,13]
print(l)
print(l[2][2])
print(l[2][3][2])

Output:
[21, 5, [3, 18, 34, [67, 9, 12], 88, 73], 65, 13]
34
12

5. Mutable

l = [21,5,3]
print(l)
del l[0]
print("Deleted l[0]:",l[0])
print(l)
l[1] = 99
print("Updated l[1] to 99:",l)
l += ["new"]
print("Added 'new':",l)
l += "hello"
print(l)
l.append("append")
print("Appended 'append':",l)
print(l + ["another","list"])
l.extend("extend")
print("Extended 'extend':",l)
l.insert(2,"insert")
print("Inserted 'insert':",l)
l.remove("l")
print("Removed first occurrence of 'l':",l)
l.pop()
print("Poped last item:",l)
l.pop(-4)
print("Poped last 4th item:",l)
print("Location of first 'e':",l.index("e"))
print("Number of 'e':",l.count("e"))
l.clear()
print("Empty the list:",l)

Output:
[21, 5, 3]
Deleted l[0]: 5
[5, 3]
Updated l[1] to 99: [5, 99]
Added 'new': [5, 99, 'new']
[5, 99, 'new', 'h', 'e', 'l', 'l', 'o']
Appended 'append': [5, 99, 'new', 'h', 'e', 'l', 'l', 'o', 'append']
[5, 99, 'new', 'h', 'e', 'l', 'l', 'o', 'append', 'another', 'list']
Extended 'extend': [5, 99, 'new', 'h', 'e', 'l', 'l', 'o', 'append', 'e', 'x', 't', 'e', 'n', 'd']
Inserted 'insert': [5, 99, 'insert', 'new', 'h', 'e', 'l', 'l', 'o', 'append', 'e', 'x', 't', 'e', 'n', 'd']
Removed first occurrence of 'l': [5, 99, 'insert', 'new', 'h', 'e', 'l', 'o', 'append', 'e', 'x', 't', 'e', 'n', 'd']
Poped last item: [5, 99, 'insert', 'new', 'h', 'e', 'l', 'o', 'append', 'e', 'x', 't', 'e', 'n']
Poped last 4th item: [5, 99, 'insert', 'new', 'h', 'e', 'l', 'o', 'append', 'e', 't', 'e', 'n']
Location of first 'e': 5
Number of 'e': 3
Empty the list: []

Slicing

l = [21,5,3,18,34,67,9,12,88,73,65,13]
print(l[3:7])

Output: [18, 34, 67, 9]

List Comprehension

l = [x for x in range(10)]
print(l)

Output: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Note:
List comprehension comes with cost.

%timeit l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
- 63.7 ns ± 1.29 ns per loop
%timeit l = [x for x in range(10)]
- 536 ns ± 11.1 ns per loop
%timeit l = list(map(lambda x: x, range(10)))
- 959 ns ± 12.1 ns per loop

Saturday, July 20, 2013

Database - Data Warehouse - Staging Table Design

Staging Table

In data warehousing or reporting-driven OTLP system, it is very normal that people use staging table to store temporary data before they transfer/cleanse/transform the data into a permanent table.
One of the reasons that motivates us to use staging table is to make sure a clean copy of data is in its latest state and has been processed successfully before pushing it to final table. In another word, instead of locking live table for as long as the transformation process will take, we make all the heavy processing behind the scene. In the end of the process, we will just swap the table. This will reduce user interruption from hours to just milli- or microsecond.

The Design

Imagine we have a flat file with 10 columns. We will create a staging table with the exact same 10 columns with corresponding data type. And, we will leave all columns as nullable.

Why?
In each RDBMS, there is always a bulk loading utility to insert data into the table. And, always, this utility will perform much faster compare to any other DML command. For example, bcp.exe is the bulk loading utility in MS SQL Server. To import 1 million records into a table, bcp.exe took me 9.4 seconds on my machine. BULK INSERT took me 13 seconds. INSERT statement is the slowest - 1 minute. Also, bcp.exe has the minimum IO impact on harddisk as well.

Some people may comment Java/C#/C++ can do better job. However, I am not an expert in that area. So, I have no comment on that. :)

Next, when we have imported data into our staging table, we can create data validation script to identify bad data by batch. Well, in database world, everything is set-based. If we were to identity bad data row-by-row, might as well we do everything in Java/C#/C++. So, now, we will log all data violation information based on rules that user has provided.

After we have removing bad data (or unqualified records), we may perform process below:
1. create ID for new natural keys
2. map records to existing ID based on natural key
3. identify relationship between tables (find foreign key ID)
4. use SCD1, SCD2 or SCD3 method to insert data into final table(s)

Sometimes, due to complexity of the source data, we may need to create second/third level staging. This could be very helpful to perform step-by-step verification and increase data quality in the final table.

I hope you enjoy this article. Please feel free to comment. :)

Tuesday, March 5, 2013

Database - SQL Server - Convert Delimited String into Rows

Scenario:
Developer comes to us with a problem statement: How do we convert a delimited string variable into rows using T-SQL?
For example:
DECLARE @str VARCHAR(200)='aaa,bbb,ccc,ddd'

Expected output:

Result

aaa

bbb

ccc

ddd

Okay, I manage to find 3 methods on the Internet. :D

Method 1
SELECT SUBSTRING(@str,CHARINDEX(',',','+@str+',',number),CHARINDEX(',',','+@str+',',number+1)-CHARINDEX(',',','+@str+',',number)-1)
FROM master.dbo.spt_values
WHERE name IS NULL AND number <= LEN(@str)
AND number=CHARINDEX(',',','+@str+',',number)

Method 2
DECLARE @xml xml
SET @xml=N''+REPLACE(@str,',','
')+''SELECT x.value('.','NVARCHAR(255)')
FROM @xml.node('//root/r's) AS a(x)

Method 3
SET @str='SELECT '''+REPLACE(@str,',',''' UNION ALL SELECT ''')+''''
EXEC(@str)

Somehow, I find that method 3 is the fastest on my machine.
Some benchmark data:

Method	Avg Speed (ms)
Method 1	143
Method 2	33
Method 3	3

Do you get same result on your machine?

Leo Leong's Blog

Wednesday, June 12, 2019

Initial Data Analysis (IDA)

Wednesday, May 22, 2019

Python Dictionary

Tuesday, May 21, 2019

Python Sets

Wednesday, May 15, 2019

Python Tuples

Tuesday, May 14, 2019

Journey to Data Scientist

Python

Statistics

Machine Learning

References

Python Lists

Saturday, July 20, 2013

Database - Data Warehouse - Staging Table Design

Staging Table

The Design

Tuesday, March 5, 2013

Database - SQL Server - Convert Delimited String into Rows

About Me

Blog Archive

SQL Server Links

PHP Links

Followers