Wednesday, June 12, 2019

Initial Data Analysis (IDA)

Reference
Introduction
  • IDA aims to inspect and inject data prior to main data analysis stage.
  • Stages:
    1. Data quality check
    2. Data transformation
    3. Data randomization
    4. Data characteristics documentation

Data Quality Check
  • Assessment types:
    1. Frequency counts
    2. Descriptive statistics (mean, standard deviation, median)
    3. Normality (skewness, kurtosis, frequency histograms)
  • Types of Data issue:
    1. Duplicate record
    2. Inconsistent date and time stamps
    3. Outliers
    4. Missing values

Data Transformation
  • Assessment types:
    1. Square root transformation
    2. Log-transformation
    3. Inverse transformation
    4. Make categorical

Data Randomization
  • Randomize the data and prove that sample data agree with the original intentions
  • Methods:
    1. Generate random permutation of the data
    2. Select random sample of the data

Data Characteristics Documentation
  • Changes (modified/removed/manipulated) to the original data
  • Shape of the distribution of variables
  • Error rates/patterns
  • Criteria to detect abnormality

Wednesday, May 22, 2019

Python Dictionary

Links: Journey to Data Scientist


Characteristics
  • Unordered (followed the sequence the keys were added)
  • Mutable
  • Can be nested
  • Access via key

#create an empty dictionary
d = {}
print(type(d))

Output: <class 'dict'>


d = {'key1':'value1',2:'value2',0.01:201}
print(d)
print(d['key1'])
print(d[2])
print(d[0.01])

Output:
{'key1': 'value1', 2: 'value2', 0.01: 201}
value1
value2
201


d[2] = 'update value'
print(d[2])

Output: update value


for k, v in d.items():
  print(k, ":", v)

Output:
key1 : value1
2 : update value
0.01 : 201


if 'key1' in d:
  print(d['key1'])

Output: value1


d['add'] = 1990
print(d)

Output: {'key1': 'value1', 2: 'update value', 0.01: 201, 'add': 1990}


d.pop('add')
print(d)

Output: {'key1': 'value1', 2: 'update value', 0.01: 201}


print(d)
d.clear()
print(d)

Output:
{'key1': 'value1', 2: 'value2', 0.01: 201}
{}


Tuesday, May 21, 2019

Python Sets

Links: Journey to Data Scientist


Characteristics
  • Unordered
  • Hashable
  • Unique
  • Immutable
  • Support mathematical operations like union, intersection, difference, and symmetric difference

1. Unordered (in fact, output is ordered by hashed value)
Example #1
s = {"abc",123,9.99,"DEF",3.142,100}
print(s)
print(type(s))

Output:
{3.142, 100, 9.99, 'abc', 'DDf', 123}
<class 'set'>

Example #2
s = s = {"abc","123","9.99","DDf","3.142","100"}
print(s)
print(type(s))

Output:
{'100', 'abc', 'DDf', '3.142', '9.99', '123'}
<class 'set'>

Example #1 and #2 above show different ordering of output compare to their original definition. This is because Python will hash the elements first. Different data type will use different algorithm.


2. Common Operations
s = set() #empty set
print(s)
s = {"abc",123,9.99,"DDf",3.142,100} #100 will be the first one to be pop-ed
print(s)
print("Length of s: ",len(s)) #number of element in set
s.add("one item") #add one element
print("Added 'one item':",s)
s.update(("i1","i2","i3")) #add multiple items
print("Updated 'i1, i2, i3':",s)
s.remove(3.142)
print("Removed 3.142: ",s)
s.discard(99999) #this will not trigger error
print("Discarded 99999 (no effect): ",s)
s.discard("DDf")
print("Discarded DDf: ",s)
s.pop()
print("Poped last item (in original definition): ",s)
s.clear() #make it an empty set
print(s)

s.remove(99999) #this will trigger error because 99999 does not exist in the set
del s #this will release 's' from memory, this variable will become undefined

Output:
set()
{3.142, 100, 9.99, 'abc', 'DDf', 123}
Length of s: 6
Added 'one item': {3.142, 100, 'one item', 9.99, 'abc', 'DDf', 123}
Updated 'i1, i2, i3': {3.142, 100, 'one item', 'i2', 9.99, 'abc', 'DDf', 'i1', 123, 'i3'}
Removed 3.142: {100, 'one item', 'i2', 9.99, 'abc', 'DDf', 'i1', 123, 'i3'}
Discarded 99999 (no effect): {100, 'one item', 'i2', 9.99, 'abc', 'DDf', 'i1', 123, 'i3'}
Discarded DDf: {100, 'one item', 'i2', 9.99, 'abc', 'i1', 123, 'i3'}
Poped last item (in original definition): {'one item', 'i2', 9.99, 'abc', 'i1', 123, 'i3'}
set()


2. Mathematical Operations
s1 = {1,3,4,8,9}
s2 = {5,7,6,8,9}

print("Difference: ",s1.difference(s2))
print("Intersection: ",s1.intersection(s2))
print("IsDisjoint: ",s1.isdisjoint(s2))
print("IsSubSet: ",s1.issubset(s2))
print("IsSuperSet: ",s1.issuperset(s2))
print("Symmetric Difference: ",s1.symmetric_difference(s2))
print("Union: ",s1.union(s2))

Output:
Difference: {1, 3, 4}
Intersection: {8, 9}
IsDisjoint: False
IsSubSet: False
IsSuperSet: False
Symmetric Difference: {1, 3, 4, 5, 6, 7}
Union: {1, 3, 4, 5, 6, 7, 8, 9}

s1 = {1,3,4,8,9}
s2 = {5,7,6,8,9}

s1.difference_update(s2)
print("After Difference Update: ",s1)

Output: After Difference Update: {1, 3, 4}

s1 = {1,3,4,8,9}
s2 = {5,7,6,8,9}

s1.intersection_update(s2)
print("After Intersection Update: ",s1)

Output: After Intersection Update: {8, 9}

s1 = {1,3,4,8,9}
s2 = {5,7,6,8,9}

s1.symmetric_difference_update(s2)
print("After Symmetric Difference Update: ",s1)

Output: After Symmetric Difference Update: {1, 3, 4, 5, 6, 7}

Wednesday, May 15, 2019

Python Tuples

Links: Journey to Data Scientist


Characteristics
  • Ordered
  • Can contain arbitrary objects
  • Accesible via index
  • Can be nested
  • Immutable

It shares most of the characteristics as Lists except it is immutable.
Therefore, operations on Tuples are faster than Lists (when number of elements is getting larger)

Immutable
t = ("apple","moon","3",57,9.99,0x3e)
print(t)
print(t[1])
print(t[3:6])
print(t[::4])

Output:
('apple', 'moon', '3', 57, 9.99, 62)
moon
(57, 9.99, 62)
('apple', 9.99)

If we try to modify the value:
t[0] = "orange"
Output:
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-9456d7af7fd3> in <module>
----> 1 t[0] = "orange"

TypeError: 'tuple' object does not support item assignment

Unpacking Tuples
t = ("apple","moon","3",57,9.99,0x3e)
(u1,u2,u3,u4,u5,u6) = t
print(u1,u2,u3,u4,u5,u6)

Output: apple moon 3 57 9.99 62




Tuesday, May 14, 2019

Journey to Data Scientist

Python

  • Basics
    • List | Tuple| Set| Dictionary
    • Lambda
    • Array
    • Object Oriented Programming (OOP)
    • Iterator
    • JSon
    • Regular Expression
    • File Operations
    • Database Operations
  • Numpy
    • Array
    • Universal Functions
    • Broadcasting
  • Pandas
    • Series
    • DataFrame
    • Pivot Table
  • Visualization
    • matplotlib.pyplot
    • seaborn
    • bokeh

Statistics

  • Initial Data Analysis (IDA)
  • Exploratory Data Analysis (EDA)
  • Confirmatory Data Analysis (CDA)
  • Probabilistic Logic and Statistical Inference
  • Parameter estimation by optimization
  • Dimension Reduction

Machine Learning

  • Supervised
  • Unsupervised
  • Reinforced

References

Python Lists

Links: Journey to Data Scientist


Characteristics
  • Ordered
  • Can contain arbitrary objects
  • Access via index
  • Can be nested
  • Mutable

1. Ordered
l = [21,5,3,18,34,67,9,12,88,73,65,13]
print(l)

Output: [21, 5, 3, 18, 34, 67, 9, 12, 88, 73, 65, 13]

2. Can contain arbitrary objects
l = [21,0x1e2f,'this is a string',3.142,8.176J,len('99999')]
print(l)

Output: [21, 7727, 'this is a string', 3.142, 8.176j, 5]

3. Access via index
l = [21,5,3,18,34,67,9,12,88,73,65,13]
print(l[0],l[-1])

Output: 21 13

4. Can be nested
l = [21,5,[3,18,34,[67,9,12],88,73],65,13]
print(l)
print(l[2][2])
print(l[2][3][2])

Output:
[21, 5, [3, 18, 34, [67, 9, 12], 88, 73], 65, 13]
34
12

5. Mutable
l = [21,5,3]
print(l)
del l[0]
print("Deleted l[0]:",l[0])
print(l)
l[1] = 99
print("Updated l[1] to 99:",l)
l += ["new"]
print("Added 'new':",l)
l += "hello"
print(l)
l.append("append")
print("Appended 'append':",l)
print(l + ["another","list"])
l.extend("extend")
print("Extended 'extend':",l)
l.insert(2,"insert")
print("Inserted 'insert':",l)
l.remove("l")
print("Removed first occurrence of 'l':",l)
l.pop()
print("Poped last item:",l)
l.pop(-4)
print("Poped last 4th item:",l)
print("Location of first 'e':",l.index("e"))
print("Number of 'e':",l.count("e"))
l.clear()
print("Empty the list:",l)

Output:
[21, 5, 3]
Deleted l[0]: 5
[5, 3]
Updated l[1] to 99: [5, 99]
Added 'new': [5, 99, 'new']
[5, 99, 'new', 'h', 'e', 'l', 'l', 'o']
Appended 'append': [5, 99, 'new', 'h', 'e', 'l', 'l', 'o', 'append']
[5, 99, 'new', 'h', 'e', 'l', 'l', 'o', 'append', 'another', 'list']
Extended 'extend': [5, 99, 'new', 'h', 'e', 'l', 'l', 'o', 'append', 'e', 'x', 't', 'e', 'n', 'd']
Inserted 'insert': [5, 99, 'insert', 'new', 'h', 'e', 'l', 'l', 'o', 'append', 'e', 'x', 't', 'e', 'n', 'd']
Removed first occurrence of 'l': [5, 99, 'insert', 'new', 'h', 'e', 'l', 'o', 'append', 'e', 'x', 't', 'e', 'n', 'd']
Poped last item: [5, 99, 'insert', 'new', 'h', 'e', 'l', 'o', 'append', 'e', 'x', 't', 'e', 'n']
Poped last 4th item: [5, 99, 'insert', 'new', 'h', 'e', 'l', 'o', 'append', 'e', 't', 'e', 'n']
Location of first 'e': 5
Number of 'e': 3
Empty the list: []

Slicing
l = [21,5,3,18,34,67,9,12,88,73,65,13]
print(l[3:7])

Output: [18, 34, 67, 9]

List Comprehension
l = [x for x in range(10)]
print(l)

Output: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Note:
List comprehension comes with cost.
  • %timeit l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    • 63.7 ns ± 1.29 ns per loop
  • %timeit l = [x for x in range(10)]
    • 536 ns ± 11.1 ns per loop
  • %timeit l = list(map(lambda x: x, range(10)))
    • 959 ns ± 12.1 ns per loop

Saturday, July 20, 2013

Database - Data Warehouse - Staging Table Design

Staging Table

In data warehousing or reporting-driven OTLP system, it is very normal that people use staging table to store temporary data before they transfer/cleanse/transform the data into a permanent table.
One of the reasons that motivates us to use staging table is to make sure a clean copy of data is in its latest state and has been processed successfully before pushing it to final table. In another word, instead of locking live table for as long as the transformation process will take, we make all the heavy processing behind the scene. In the end of the process, we will just swap the table. This will reduce user interruption from hours to just milli- or microsecond.

The Design

Imagine we have a flat file with 10 columns. We will create a staging table with the exact same 10 columns with corresponding data type. And, we will leave all columns as nullable.

Why?
In each RDBMS, there is always a bulk loading utility to insert data into the table. And, always, this utility will perform much faster compare to any other DML command. For example, bcp.exe is the bulk loading utility in MS SQL Server. To import 1 million records into a table, bcp.exe took me 9.4 seconds on my machine. BULK INSERT took me 13 seconds. INSERT statement is the slowest - 1 minute. Also, bcp.exe has the minimum IO impact on harddisk as well.

Some people may comment Java/C#/C++ can do better job. However, I am not an expert in that area. So, I have no comment on that. :)

Next, when we have imported data into our staging table, we can create data validation script to identify bad data by batch. Well, in database world, everything is set-based. If we were to identity bad data row-by-row, might as well we do everything in Java/C#/C++. So, now, we will log all data violation information based on rules that user has provided.

After we have removing bad data (or unqualified records), we may perform process below:
1. create ID for new natural keys
2. map records to existing ID based on natural key
3. identify relationship between tables (find foreign key ID)
4. use SCD1, SCD2 or SCD3 method to insert data into final table(s)

Sometimes, due to complexity of the source data, we may need to create second/third level staging. This could be very helpful to perform step-by-step verification and increase data quality in the final table.

I hope you enjoy this article. Please feel free to comment. :)

Tuesday, March 5, 2013

Database - SQL Server - Convert Delimited String into Rows

Scenario:
Developer comes to us with a problem statement: How do we convert a delimited string variable into rows using T-SQL?
For example:
DECLARE @str VARCHAR(200)='aaa,bbb,ccc,ddd'

Expected output:
  Result  
aaa
bbb
ccc
ddd


Okay, I manage to find 3 methods on the Internet. :D

Method 1
SELECT SUBSTRING(@str,CHARINDEX(',',','+@str+',',number),CHARINDEX(',',','+@str+',',number+1)-CHARINDEX(',',','+@str+',',number)-1)
FROM master.dbo.spt_values
WHERE name IS NULL AND number <= LEN(@str)
AND number=CHARINDEX(',',','+@str+',',number)

Method 2
DECLARE @xml xml
SET @xml=N''+REPLACE(@str,',','
')+''SELECT x.value('.','NVARCHAR(255)')
FROM @xml.node('//root/r's) AS a(x)

Method 3
SET @str='SELECT '''+REPLACE(@str,',',''' UNION ALL SELECT ''')+''''
EXEC(@str)

Somehow, I find that method 3 is the fastest on my machine.
Some benchmark data:
MethodAvg Speed (ms)
Method 1143
Method 233
Method 33


Do you get same result on your machine?