Missing¶
Python can easily handle the missing, it has different symbols to work with missing: None and np.nan acting the same. np.nan is a float number so when it used, the type of data change to float number. Numpy uses NaN as missing. Panda use pd.NA instead None and np.nan.
import numpy as np
import pandas as pd
raw_data = {'income': [10,np.nan,14,16],'pay': [9,11,13,pd.NA],}
dat = pd.DataFrame(raw_data, columns = ['income','pay'])
dat
get the number of missing data points per column
How many total missing values do we have?
Percent of data that is missing
Drop missing¶
Remove all the rows that contain a missing value
dat.dropna()
dat.notnull() #let you highlight values which are not empty (NaN)
dat.isnull() #let you highlight values which are empty (NaN)
dat.notna() #let you highlight values which are not NaN
Remove all columns with at least one missing value
Filling missing¶
They fill with the mean of other values.
replace all NA's with -9
pandas defines different procedures for filling missing, the following code interpolate the NaN.
Sometimes one needs to define part of data as missing, it can be done using.apply You can use math.isnan to on numpy