[Pandas]Handle missing data

Missing data is a common problem in real data preprocessing, luckily pandas has done a lot to help us handle it. This article will show the codes on how to do it.

###Produce some data with missing values.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
                 columns=['one', 'two', 'three'])
df['four']='bar'
df['five']= df['one'] > 0
df['date']= pd.Timestamp('2018-01-01')
df2 = df.reindex(['a','b', 'c','d', 'e', 'f','g', 'h'])
df2

	one	two	three	four	five	date
a	-0.614301	0.628725	-0.903163	bar	False	2018-01-01
b	NaN	NaN	NaN	NaN	NaN	NaT
c	-0.297019	1.357266	-0.665749	bar	False	2018-01-01
d	NaN	NaN	NaN	NaN	NaN	NaT
e	0.311524	-0.328388	0.777467	bar	True	2018-01-01
f	0.572373	-0.309563	-0.296276	bar	True	2018-01-01
g	NaN	NaN	NaN	NaN	NaN	NaT
h	-1.303842	1.239911	-0.909255	bar	False	2018-01-01

Notice that NaN is the default marker of missing number, text or boolean, NaT is for missing datetime

Detect missing value

pandas provides two methods, isna() and notna()

df2.isna()

	one	two	three	four	five	date
a	False	False	False	False	False	False
b	True	True	True	True	True	True
c	False	False	False	False	False	False
d	True	True	True	True	True	True
e	False	False	False	False	False	False
f	False	False	False	False	False	False
g	True	True	True	True	True	True
h	False	False	False	False	False	False

df2.notna()

	one	two	three	four	five	date
a	True	True	True	True	True	True
b	False	False	False	False	False	False
c	True	True	True	True	True	True
d	False	False	False	False	False	False
e	True	True	True	True	True	True
f	True	True	True	True	True	True
g	False	False	False	False	False	False
h	True	True	True	True	True	True

Cleaning

drop na value, dropna() has a parameter axis which indicates either colums(axis = 1) or rows(axis = 0) will be dropped, the default value of axis is 0

df2.dropna()

	one	two	three	four	five	date
a	-0.614301	0.628725	-0.903163	bar	False	2018-01-01
c	-0.297019	1.357266	-0.665749	bar	False	2018-01-01
e	0.311524	-0.328388	0.777467	bar	True	2018-01-01
f	0.572373	-0.309563	-0.296276	bar	True	2018-01-01
h	-1.303842	1.239911	-0.909255	bar	False	2018-01-01

Drop columns with missing data

df2.dropna(axis = 1)


a
b
c
d
e
f
g
h

All columns are removed..

Fill na

Na value can be easily filled by fillna()

df2.fillna(0)

	one	two	three	four	five	date
a	-0.614301	0.628725	-0.903163	bar	False	2018-01-01 00:00:00
b	0.000000	0.000000	0.000000	0	0	0
c	-0.297019	1.357266	-0.665749	bar	False	2018-01-01 00:00:00
d	0.000000	0.000000	0.000000	0	0	0
e	0.311524	-0.328388	0.777467	bar	True	2018-01-01 00:00:00
f	0.572373	-0.309563	-0.296276	bar	True	2018-01-01 00:00:00
g	0.000000	0.000000	0.000000	0	0	0
h	-1.303842	1.239911	-0.909255	bar	False	2018-01-01 00:00:00

We can also fill data with respect to their column property

df2.fillna(value = {'one': 0, 'four': 'missing', 'five': False, 'date': pd.Timestamp('2000-01-01')})

	one	two	three	four	five	date
a	-0.614301	0.628725	-0.903163	bar	False	2018-01-01
b	0.000000	NaN	NaN	missing	False	2000-01-01
c	-0.297019	1.357266	-0.665749	bar	False	2018-01-01
d	0.000000	NaN	NaN	missing	False	2000-01-01
e	0.311524	-0.328388	0.777467	bar	True	2018-01-01
f	0.572373	-0.309563	-0.296276	bar	True	2018-01-01
g	0.000000	NaN	NaN	missing	False	2000-01-01
h	-1.303842	1.239911	-0.909255	bar	False	2018-01-01

Remind that fillna() will not replace na in orignial dataframe, in other words, the df2 remains the same after all the above operation, If you want it to be permanantely changed, add parameter inplace = True like this.

df2.fillna(value = 0, inplace = True)