Wednesday, June 12, 2019

Initial Data Analysis (IDA)

Reference
Introduction
  • IDA aims to inspect and inject data prior to main data analysis stage.
  • Stages:
    1. Data quality check
    2. Data transformation
    3. Data randomization
    4. Data characteristics documentation

Data Quality Check
  • Assessment types:
    1. Frequency counts
    2. Descriptive statistics (mean, standard deviation, median)
    3. Normality (skewness, kurtosis, frequency histograms)
  • Types of Data issue:
    1. Duplicate record
    2. Inconsistent date and time stamps
    3. Outliers
    4. Missing values

Data Transformation
  • Assessment types:
    1. Square root transformation
    2. Log-transformation
    3. Inverse transformation
    4. Make categorical

Data Randomization
  • Randomize the data and prove that sample data agree with the original intentions
  • Methods:
    1. Generate random permutation of the data
    2. Select random sample of the data

Data Characteristics Documentation
  • Changes (modified/removed/manipulated) to the original data
  • Shape of the distribution of variables
  • Error rates/patterns
  • Criteria to detect abnormality

Wednesday, May 22, 2019

Python Dictionary

Links: Journey to Data Scientist


Characteristics
  • Unordered (followed the sequence the keys were added)
  • Mutable
  • Can be nested
  • Access via key

#create an empty dictionary
d = {}
print(type(d))

Output: <class 'dict'>


d = {'key1':'value1',2:'value2',0.01:201}
print(d)
print(d['key1'])
print(d[2])
print(d[0.01])

Output:
{'key1': 'value1', 2: 'value2', 0.01: 201}
value1
value2
201


d[2] = 'update value'
print(d[2])

Output: update value


for k, v in d.items():
  print(k, ":", v)

Output:
key1 : value1
2 : update value
0.01 : 201


if 'key1' in d:
  print(d['key1'])

Output: value1


d['add'] = 1990
print(d)

Output: {'key1': 'value1', 2: 'update value', 0.01: 201, 'add': 1990}


d.pop('add')
print(d)

Output: {'key1': 'value1', 2: 'update value', 0.01: 201}


print(d)
d.clear()
print(d)

Output:
{'key1': 'value1', 2: 'value2', 0.01: 201}
{}


Tuesday, May 21, 2019

Python Sets

Links: Journey to Data Scientist


Characteristics
  • Unordered
  • Hashable
  • Unique
  • Immutable
  • Support mathematical operations like union, intersection, difference, and symmetric difference

1. Unordered (in fact, output is ordered by hashed value)
Example #1
s = {"abc",123,9.99,"DEF",3.142,100}
print(s)
print(type(s))

Output:
{3.142, 100, 9.99, 'abc', 'DDf', 123}
<class 'set'>

Example #2
s = s = {"abc","123","9.99","DDf","3.142","100"}
print(s)
print(type(s))

Output:
{'100', 'abc', 'DDf', '3.142', '9.99', '123'}
<class 'set'>

Example #1 and #2 above show different ordering of output compare to their original definition. This is because Python will hash the elements first. Different data type will use different algorithm.


2. Common Operations
s = set() #empty set
print(s)
s = {"abc",123,9.99,"DDf",3.142,100} #100 will be the first one to be pop-ed
print(s)
print("Length of s: ",len(s)) #number of element in set
s.add("one item") #add one element
print("Added 'one item':",s)
s.update(("i1","i2","i3")) #add multiple items
print("Updated 'i1, i2, i3':",s)
s.remove(3.142)
print("Removed 3.142: ",s)
s.discard(99999) #this will not trigger error
print("Discarded 99999 (no effect): ",s)
s.discard("DDf")
print("Discarded DDf: ",s)
s.pop()
print("Poped last item (in original definition): ",s)
s.clear() #make it an empty set
print(s)

s.remove(99999) #this will trigger error because 99999 does not exist in the set
del s #this will release 's' from memory, this variable will become undefined

Output:
set()
{3.142, 100, 9.99, 'abc', 'DDf', 123}
Length of s: 6
Added 'one item': {3.142, 100, 'one item', 9.99, 'abc', 'DDf', 123}
Updated 'i1, i2, i3': {3.142, 100, 'one item', 'i2', 9.99, 'abc', 'DDf', 'i1', 123, 'i3'}
Removed 3.142: {100, 'one item', 'i2', 9.99, 'abc', 'DDf', 'i1', 123, 'i3'}
Discarded 99999 (no effect): {100, 'one item', 'i2', 9.99, 'abc', 'DDf', 'i1', 123, 'i3'}
Discarded DDf: {100, 'one item', 'i2', 9.99, 'abc', 'i1', 123, 'i3'}
Poped last item (in original definition): {'one item', 'i2', 9.99, 'abc', 'i1', 123, 'i3'}
set()


2. Mathematical Operations
s1 = {1,3,4,8,9}
s2 = {5,7,6,8,9}

print("Difference: ",s1.difference(s2))
print("Intersection: ",s1.intersection(s2))
print("IsDisjoint: ",s1.isdisjoint(s2))
print("IsSubSet: ",s1.issubset(s2))
print("IsSuperSet: ",s1.issuperset(s2))
print("Symmetric Difference: ",s1.symmetric_difference(s2))
print("Union: ",s1.union(s2))

Output:
Difference: {1, 3, 4}
Intersection: {8, 9}
IsDisjoint: False
IsSubSet: False
IsSuperSet: False
Symmetric Difference: {1, 3, 4, 5, 6, 7}
Union: {1, 3, 4, 5, 6, 7, 8, 9}

s1 = {1,3,4,8,9}
s2 = {5,7,6,8,9}

s1.difference_update(s2)
print("After Difference Update: ",s1)

Output: After Difference Update: {1, 3, 4}

s1 = {1,3,4,8,9}
s2 = {5,7,6,8,9}

s1.intersection_update(s2)
print("After Intersection Update: ",s1)

Output: After Intersection Update: {8, 9}

s1 = {1,3,4,8,9}
s2 = {5,7,6,8,9}

s1.symmetric_difference_update(s2)
print("After Symmetric Difference Update: ",s1)

Output: After Symmetric Difference Update: {1, 3, 4, 5, 6, 7}

Wednesday, May 15, 2019

Python Tuples

Links: Journey to Data Scientist


Characteristics
  • Ordered
  • Can contain arbitrary objects
  • Accesible via index
  • Can be nested
  • Immutable

It shares most of the characteristics as Lists except it is immutable.
Therefore, operations on Tuples are faster than Lists (when number of elements is getting larger)

Immutable
t = ("apple","moon","3",57,9.99,0x3e)
print(t)
print(t[1])
print(t[3:6])
print(t[::4])

Output:
('apple', 'moon', '3', 57, 9.99, 62)
moon
(57, 9.99, 62)
('apple', 9.99)

If we try to modify the value:
t[0] = "orange"
Output:
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-9456d7af7fd3> in <module>
----> 1 t[0] = "orange"

TypeError: 'tuple' object does not support item assignment

Unpacking Tuples
t = ("apple","moon","3",57,9.99,0x3e)
(u1,u2,u3,u4,u5,u6) = t
print(u1,u2,u3,u4,u5,u6)

Output: apple moon 3 57 9.99 62




Tuesday, May 14, 2019

Journey to Data Scientist

Python

  • Basics
    • List | Tuple| Set| Dictionary
    • Lambda
    • Array
    • Object Oriented Programming (OOP)
    • Iterator
    • JSon
    • Regular Expression
    • File Operations
    • Database Operations
  • Numpy
    • Array
    • Universal Functions
    • Broadcasting
  • Pandas
    • Series
    • DataFrame
    • Pivot Table
  • Visualization
    • matplotlib.pyplot
    • seaborn
    • bokeh

Statistics

  • Initial Data Analysis (IDA)
  • Exploratory Data Analysis (EDA)
  • Confirmatory Data Analysis (CDA)
  • Probabilistic Logic and Statistical Inference
  • Parameter estimation by optimization
  • Dimension Reduction

Machine Learning

  • Supervised
  • Unsupervised
  • Reinforced

References

Python Lists

Links: Journey to Data Scientist


Characteristics
  • Ordered
  • Can contain arbitrary objects
  • Access via index
  • Can be nested
  • Mutable

1. Ordered
l = [21,5,3,18,34,67,9,12,88,73,65,13]
print(l)

Output: [21, 5, 3, 18, 34, 67, 9, 12, 88, 73, 65, 13]

2. Can contain arbitrary objects
l = [21,0x1e2f,'this is a string',3.142,8.176J,len('99999')]
print(l)

Output: [21, 7727, 'this is a string', 3.142, 8.176j, 5]

3. Access via index
l = [21,5,3,18,34,67,9,12,88,73,65,13]
print(l[0],l[-1])

Output: 21 13

4. Can be nested
l = [21,5,[3,18,34,[67,9,12],88,73],65,13]
print(l)
print(l[2][2])
print(l[2][3][2])

Output:
[21, 5, [3, 18, 34, [67, 9, 12], 88, 73], 65, 13]
34
12

5. Mutable
l = [21,5,3]
print(l)
del l[0]
print("Deleted l[0]:",l[0])
print(l)
l[1] = 99
print("Updated l[1] to 99:",l)
l += ["new"]
print("Added 'new':",l)
l += "hello"
print(l)
l.append("append")
print("Appended 'append':",l)
print(l + ["another","list"])
l.extend("extend")
print("Extended 'extend':",l)
l.insert(2,"insert")
print("Inserted 'insert':",l)
l.remove("l")
print("Removed first occurrence of 'l':",l)
l.pop()
print("Poped last item:",l)
l.pop(-4)
print("Poped last 4th item:",l)
print("Location of first 'e':",l.index("e"))
print("Number of 'e':",l.count("e"))
l.clear()
print("Empty the list:",l)

Output:
[21, 5, 3]
Deleted l[0]: 5
[5, 3]
Updated l[1] to 99: [5, 99]
Added 'new': [5, 99, 'new']
[5, 99, 'new', 'h', 'e', 'l', 'l', 'o']
Appended 'append': [5, 99, 'new', 'h', 'e', 'l', 'l', 'o', 'append']
[5, 99, 'new', 'h', 'e', 'l', 'l', 'o', 'append', 'another', 'list']
Extended 'extend': [5, 99, 'new', 'h', 'e', 'l', 'l', 'o', 'append', 'e', 'x', 't', 'e', 'n', 'd']
Inserted 'insert': [5, 99, 'insert', 'new', 'h', 'e', 'l', 'l', 'o', 'append', 'e', 'x', 't', 'e', 'n', 'd']
Removed first occurrence of 'l': [5, 99, 'insert', 'new', 'h', 'e', 'l', 'o', 'append', 'e', 'x', 't', 'e', 'n', 'd']
Poped last item: [5, 99, 'insert', 'new', 'h', 'e', 'l', 'o', 'append', 'e', 'x', 't', 'e', 'n']
Poped last 4th item: [5, 99, 'insert', 'new', 'h', 'e', 'l', 'o', 'append', 'e', 't', 'e', 'n']
Location of first 'e': 5
Number of 'e': 3
Empty the list: []

Slicing
l = [21,5,3,18,34,67,9,12,88,73,65,13]
print(l[3:7])

Output: [18, 34, 67, 9]

List Comprehension
l = [x for x in range(10)]
print(l)

Output: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Note:
List comprehension comes with cost.
  • %timeit l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    • 63.7 ns ± 1.29 ns per loop
  • %timeit l = [x for x in range(10)]
    • 536 ns ± 11.1 ns per loop
  • %timeit l = list(map(lambda x: x, range(10)))
    • 959 ns ± 12.1 ns per loop