Pandas¶

Introduction¶

Pandas is built on top of NumPy, and its functions are highly useful for working with datasets. The shorthand for importing this library is:

import pandas as pd

Data-frame¶

A DataFrame in Pandas is a highly useful format for working with datasets. It is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). The following code demonstrates how to create a DataFrame from a dictionary:

var={"A": [1,2,0], "B": [2,3,4]}
df= pd.DataFrame(data=var,index=['A', 'Z', 'C'])

One can reorder the index,

df1= df.reindex(['A','C','Z'])
print(df1)

The column labels can be easily modified:

raw_data = {'population': [ 1015.0, 1129.0, 333.0,  515.0],'median_income': [ 1.5, 1.8,  1.7,  3.2]}
df=pd.DataFrame(raw_data, columns = ['population', 'median_income'])

In certain situations, it is beneficial to use the data collection time as the index. The following script converts the data to a datetime format and sets it as the index:

df = df.set_index(pd.to_datetime(['2019-04-01','2019-05-04','2019-06-01','2019-07-02']))

Always use to_datetime to ensure the data is stored in a proper datetime format. This allows easy conversion to specific components such as year, weekday, and more.

df1['date'] = pd.to_datetime(['2019-04-01','2019-05-04','2019-06-01','2019-07-02'])
df1['date'].dt.weekday
df1['date'].dt.year

To create an empty DataFrame, use the following code:

df1=pd.DataFrame(columns = ['population', 'median_income'])
df2=pd.DataFrame()

Adding new column¶

A column can be easily added to a DataFrame:

df0=pd.DataFrame([38,40,25,33])
df['Ave_hour']=df0

The assign() method can also be used to add new columns to a DataFrame. Additionally, new columns can be created using functions, as shown below:

df=df.assign(Ave_hour=df0)
df=df.assign(PI1=lambda x: x['population']*x['median_income'],PI2=df['population']/df['median_income'] )

A column's name can be renamed using the rename() method, as shown below:

df.columns=['population1','median_income','Ave_hour','PI1','PI2']
df=df.rename(columns={'population1': 'pop', 'median_income': 'med_income'},inplace=True)

Apply function on row or column¶

Using df.apply(fun) allows you to apply a function to either columns or rows of a DataFrame. Here's an example:

df.apply(np.sum, axis=0)
df.apply(np.sum, axis=1)

You can even apply a function to specific columns or rows.

df.apply(lambda x: x['A'] * x['B'], axis=1)
df['productcolmn']=df.apply(lambda x: x['A'] * x['B'], axis=1)


def majority(x):
    if x > 17:
        return True
    else:
        return False

stud_alcoh['legal_drinker'] = stud_alcoh['age'].apply(majority)

Loading CSV data¶

The function pd.read_csv is used to import data saved in CSV format. The following example demonstrates how to import a CSV file into Python, with the data being the California housing dataset:

source ="https://storage.googleapis.com/mledu-datasets/california_housing_train.csv"
CHT = pd.read_csv(source, sep=",")
CHT.head()

Loading big CSV data¶

If the data is very large, you may need to split it into chunks and perform computations on each chunk.

chunk_size=100
chunk=[]
for chunk in pd.read_csv('file.csv',chunksize=chunk_size):
   do computation
  chunks.append(chunk)
df = pd.concat(chunks, axis=0)

Generating summary¶

To view the data types, summary statistics, and general information about the variables in a DataFrame, use .dtypes, .describe(), and .info().

# show the type of variables
CHT.dtypes
# generate summary
CHT.describe()
CHT.info()

It is easy to identify and remove duplicates in a DataFrame.

CHT.duplicated()
CHT.drop_duplicates()

To check for duplicates in specific columns, specify their names as well.

CHT.duplicated(['longitude'])
CHT.drop_duplicates(['longitude'], keep='last')
CHT.index.duplicated()

To truncate the display of data, you can adjust the display options using pd.set_option()

pd.set_option('display.max_rows', 50)
pd.set_option('precision', 4)

Data type¶

Pandas has different data types; object (a mix of different type of data), int64 ( interger numbers), float64(floating-point numbers), bool(True/False values), datetime64(Date and time values), category(a finite number of possible values). The following codes show how to define the data type.

raw_data = {'population': [ 1015.0, 1129.0, 333.0,  515.0],'median_income': [ 1.5, 1.8,  1.7,  3.2], 'class_income': ['low', 'low',  'low',  'high'], 'time':['2019-04-01','2019-05-04','2019-06-01','2019-07-02']}
df=pd.DataFrame(raw_data, columns = ['population', 'median_income','class_income','time'])
df.dtypes
df['population']=df['population'].astype('int64')
df['median_income']=df['median_income'].astype('float64')
df['class_income']=df['class_income'].astype(CategoricalDtype(categories=['low', 'high'], ordered=True))
df['time']=df['time'].astype('datetime64')
df.dtypes

Size of data-frame¶

The dimension of a DataFrame is 2, which can be accessed using .ndim. The number of rows and columns can be obtained with .shape.

CHT.ndim
CHT.shape
CHT.shape[0]
CHT.shape[1]

Preview¶

Beside the function print, Pandas can show the first and the last part of data, using .head() and .tail(). By passing a number in the parenthesis, one can specify the output.

CHT.head(10)
CHT.tail(10)
CHT.sort_values(by='housing_median_age', ascending=False).head(3)
CHT.columns

Manipulating data-frame¶

Subset of Data¶

To select the data use the name of variable, or specify the indices via .iloc and .loc (link)[http://pandas.pydata.org/pandas-docs/version/0.22/indexing.html]. .iloc is an integer-based and select should use integer index. On contrary, .loc is primarily label based, and may also be used with a boolean array.

CHT.longitude
CHT['longitude']
CHT.iloc[:,1]
CHT.iloc[:,[1,3,4]]
CHT.iloc[:,np.r_[1,3:6]]

To select part of row, you can also use iloc[index of row,:], also the rows can be selected using the logical values

CHT.iloc[2:10]
CHT.iloc[2:10,:]
CHT[CHT.iloc[:,1]<34]

To retieve part of row, should pass boolean variable, .iloc does not work boolean variable, and .loc should be used. Consider the median_income in our data, by using quartile divid it into three categories.

CHT['famlev'] = ''
C1=CHT.median_income<=CHT.median_income.quantile(.3)
C2=CHT.median_income>=CHT.median_income.quantile(.7)
CHT.loc[C1,'famlev']='L'
CHT.loc[~C1&~C2,'famlev']='M'
CHT.loc[C2,'famlev']='H'

Obviously, we can use & to bring the ~C1 and ~C2 together. In this case we used .loc, obviously we specify column labels to retrieve columns instead of by position.

Note: You can also using [][] apply different conditions on data.

CHT['median_house_value'][CHT['famlev'] == 'M'].mean()

Selecting or searching can be done also using np.where which evaluate the condition

CHT_R=CHT[['total_rooms','total_bedrooms']]
CHT_R.where(CHT.total_rooms<1000)
CHT_R.where(CHT.total_rooms<1000,0)
con= CHT_R<1000
CHT_R.where(con, -999)

CHT.idxmin()
CHT.idxmax()

Opposite of this np.where is np.mask, replace it with np.where and rerun the codes. To drop row and columns use .drop. The np.where can be used to create a new column,

CHT['size']=np.where(CHT.total_rooms<1000, 'small', 'big')
CHT_R=CHT[['total_rooms','total_bedrooms']]
CHT_R.where(CHT.total_rooms<1000)
CHT_R.where(CHT.total_rooms<1000,0)
con= CHT_R<1000
CHT_R.where(con, -999)

Find which ones are 'M'

np.where(CHT.loc[:,'famlev'].isin(['M']))

isin lets you select data whose value "is in" a list of values.

CHT.drop([0,5], axis=0)
CHT.drop('longitude',axis=1, inplace=True)

CHT[CHT['size'].str.startswith('b')]
CHT.iloc[: , :-3]
CHT.loc[CHT['size'].isin(['big'])]

To replace values, use df.replace()

CHT['famlev'].replace('L','Low').replace('M','Middle').replace('H','High')
CHT.drop('longitude',axis=1, inplace=True)

Note: the argument inplace=True apply the change on the original data. To transpose dataframe, run CHT.T. To change the type of column apply astype(np.int) on them

CHT.total_rooms.astype(np.int)

One can take a random subset of data using

CHT.sample(4)

To delete the columdn, use del:

del CHT['famlev']

eval¶

You can evaluate a Python expression before using it,

var={"A": [1,2,0], "B": [2,3,4], "C":['A', 'Z', 'C']}
df= pd.DataFrame(data=var)
pd.eval('df.A + df.B')

df.eval('D = A + B')
df

df.eval('D = A + B', inplace=True)
df

Evaluate a Python expression as a string using various backends.

Select row using query¶

query can be used to to select row, the following code Selecting or searching can be done also using np.where which evaluate the condition

CHT[CHT.total_rooms<1000)

CHT.query('total_rooms<1000')

To use a variable inn a panda query use @,

total=1000
CHT.query('total_rooms< @total')

You can use f-string to not use @:

total=1000
CHT.query(f'total_rooms< {total}')

Select based on data types¶

df.select_dtypes([]) return columns that has a specic type.

df.select_dtypes(include='int')
df.select_dtypes(include='bool')
df.select_dtypes(include=['float64'])

delete row or column¶

CHT.drop([0,2]) # drop first and third rows
CHT.drop(columns=['famlev']) #
CHT.drop(CHT.columns[[0,2]], axis = 1) # drop the first and third column

Combining columns¶

Columns can easily combine to create a new column

CHT['famsize']=CHT['famlev']+'-'+CHT['size']

repeat¶

It is possible to repeat the element in data frame,

df1 = pd.DataFrame([[1, 2], [3, 4], ['a', 'b']], columns=["A", "B"])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=["c", "d"])
ab=pd.DataFrame(np.repeat(df1.values, 2, axis=0), columns=df1.columns) # repeat column in sequence
ac=df2.iloc[np.tile(np.arange(len(df2)), len(df1))].reset_index(drop = True) # repeat in total

list comprehension¶

Simple operation using the list comprehension can be done on data-frame as well.

CHT['size']=['small' if x<100  else 'big'  for x in CHT['total_rooms']]

Summarising¶

Although .describe can give a summary of variables, more specific summery of variables (columns) can be extracted, see

CHT.count
CHT[CHT.iloc[:,1]<34].nunique()

The following table includes the functions.

Function	Description
`count`	Number of non-null observations
`sum`	Sum of values
`mean`	Mean of value
`mad`	Mean absolute deviation
`median`	median of values
`min`	Minimum
`max`	Maximum
`mode`	Mode
`abs`	Absolute Value
`prod`	Product of values
`std`	Unbiased standard deviation
`var`	Unbiased variance
`sem`	Unbiased standard error of the mean
`skew`	Unbiased skewness (3^rd moment)
`kurt`	Unbiased kurtosis (4^th moment)
`quantile`	Sample quantile (value at %)
`cumsum`	Cumulative sum
`cumprod`	Cumulative product
`cummax`	Cumulative maximum
`cummin`	Cumulative minimum
`nunique`	number of unique elements
`value_counts`	Counts of unique values
`cov`	Calculate the covariance between columns
`corr`	Calculate the correlation between columns

groupby¶

The summaries can be obtained using any grouping variables in the data set:

CHT.groupby(['famlev']).groups.keys()
CHT.groupby(['famlev']).groups['H']
CHT.groupby(['famlev']).first()

CHT.groupby(['famlev']).sum()

CHT.groupby(['famlev'])['median_house_value'].sum()
# better output
CHT.groupby(['famlev'])[['median_house_value']].sum()

The grouped variables would be assigned as indices, to bring them back as variables use pf.reset.index()

CHT.reset_index(drop = True)

It is possible to apply even complex function, the following scripts calculate the coefficient of data.

def cv(x):
 return (np.mean(x)/np.var(x))

aggr = {
    'total_rooms':'sum',
    'population': lambda x: cv(x)
}
CHT.groupby('famlev').agg(aggr)

The output can be tidied up,

aggr = {
    'total_rooms':['mean','std']
}
grouped = CHT.groupby('famlev').agg(aggr)
grouped.columns = grouped.columns.droplevel(level=0)
grouped.rename(columns={"mean": "total_rooms", "std": "total_rooms"})
grouped.head()

To apply the transform on the individual data point in thee data frame, use .transform(), let run the standardization on the

CHT.median_income.transform(lambda x: (x - x.mean())/x.std())

pivot_table¶

The summarizations can be done using pivot table,

pd.pivot_table(CHT, index=['famlev'], aggfunc=['mean'])

Let consider two variables in the index of pivot table, the result will be a stack table which using .unstack(), it convert to stack, see below

pv_0=pd.pivot_table(CHT, index=['famlev','size'], aggfunc=['mean'])
pv_0.columns = pv_0.columns.droplevel() # drop a level from a multi level comulmn index
pv_0.unstack()

Pivot can provide more tools, see below

df = pd.DataFrame([["a","n1","1"], ["a","n2","2"], ["b","n1","3"], ["b","n2","s4"],["b","n2","s4"]], columns=["meta1", "name", "data"])
df.pivot_table(values='data', index='meta1', columns='name', aggfunc=",".join)

Iterating over rows¶

Unlike numpy, pandas is column based, to create numpy array just use CHT.to_numpy(). A simple way tp iterate over row, one can use .iterrows() and .itertuples(). The .iterrows() creates series from row,

for index, row in CHT.iterrows():
    print(row)

The .itertuples() generates tuple of rows.

for row2 in CHT.itertuples():
    print(row2)

Merging¶

Panada is very useful for merging dataset, to achieve merging data consider the following data sets, where 'id1' and 'id2' include the ids of data.

raw_data = {'id1': range(4),'income': [10,12,14,16]}
dat1 =pd.DataFrame(raw_data, columns = ['id1', 'income'])

raw_data = {'id2': range(6),'pay': [9,11,13,15,17,19]}
dat2 =pd.DataFrame(raw_data, columns = ['id2', 'pay'])

Obviously the id variable are not the same, they can be compared using

dat1['id1'].isin(dat2['id2']).value_counts()
dat2['id2'].isin(dat1['id1']).value_counts()

value_counts() produce list of unique values and how often they occur in the dataset. pd.merge can merge different data-frames, the merging can be done based on the identities of left dataset, if there is no match in the right file, Python adds NaN.

result = pd.merge(dat1, dat2, left_on='id1', right_on='id2',how='left')

On contrary, one can the right dataset as matching,

result = pd.merge(dat1, dat2, left_on='id1', right_on='id2',how='right')

Since the ids are not the same, one can do merging based on the intersection of the ids,

result = pd.merge(dat1, dat2, left_on='id1', right_on='id2',how='inner')

Merging can also be done based on the union of the ids,

result = pd.merge(dat1, dat2, left_on='id1', right_on='id2',how='outer')

Note: If the names of id variables are the same in the both datasets, you can use on=id_name instead left_on= and right_on=. Note: if you want to identify where the date in rows are from, add argument indicator=True, then new column named _merge would be added to the merged data which show its originate.

result = pd.merge(dat1, dat2, left_on='id1', right_on='id2',how='outer', indicator=True)

To combine row-wise, use concat.

result = pd.concat([dat1, dat2],axis=1)

To combine column-wise:

result = pd.concat([dat1,dat1], axis=0)

note, if dat1 and dat2 are not dataframe, the result will bot be dataframe, so you might use

result.to_frame()

If the dataframes ahve the same columns, they can be combined using df1.append(df2).

Melt¶

To change from wide to long, we can use pd.melt(df, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True).

var={"A": [1,2,0], "B": [2,3,4], "C":['A', 'Z', 'C']}
df= pd.DataFrame(data=var)
pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
pd.melt(df, id_vars=['A'], var_name="id",value_vars=['B', 'C'],value_name='A_or_B')
pd.melt(df, id_vars=['A'], value_vars=['B', 'C'],var_name="test",value_name='value2',ignore_index=False) # keep orifinal index
pd.melt(df, id_vars=['A'], value_vars=['B', 'C'],var_name="test",value_name='value2',ignore_index=True) # generate new index
pd.melt(df, id_vars=['A'], value_vars=['B', 'C'],var_name="test",value_name='value2',ignore_index=True)

If columns are a MultiIndex then use col_level =0, 1 to decide which level to melt.

Creating bins¶

To create bins from the data value, one can use pd.cut, it actually creates a categorical variable from continuous variables

var={"A": [1,2,0,7,9,4,3,1]}
df= pd.DataFrame(data=var)
pd.cut(df.A,3)  # creat three bins
pd.cut(df.A,3,labels=["low",'middle', "high"]) # add label
pd.cut(df.A,3,labels=["low",'middle', "high"], ordered=False) # drop order
pd.cut(df.A,3, labels=False) # drop label
pd.cut(df.A,[0, .25, .5, .75, 1.]) # add the bins manually

To generate bins according the quantiles, use .gcut()

pd.qcut(df.A,[0, .25, .5, 1.])

interval_range¶

The function interval_range can eb used to data with range format, such data can be use as index as well.

pd.interval_range(start=0, end=10, freq=3)
pd.interval_range(start=0, end=10, freq=3, closed='left')
pd.interval_range(start=pd.Timestamp('2021-01-01'),end=pd.Timestamp('2021-01-30'))
pd.interval_range(start=pd.Timestamp('2021-01-01'),periods=4)

Dummy¶

It generates a dummy variable from categorial variable

var={"A": [1,2,0], "B": [2,3,4], "C":['A', 'Z', 'C']}
df= pd.DataFrame(data=var)
pd.get_dummies(df)
pd.get_dummies(df.C)

Factorize¶

The function pd.factorize() is useful to get distinct value of array or represent the array as numeric, Encode the object as an enumerated type or categorical variable.

var=pd.DataFrame({"A": [1,2,0], "B": [2,3,4], "C":['A', 'Z', 'C']})
lab, uniques = pd.factorize(var.C)
lab
uniques

Pandas has the function pd.Categorical() that can be use to represent a categorical data, the followinf scripts show how pd.factorize can be use on categorical data.

var = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
lab, uniques = pd.factorize(var)
lab
uniques

Crosstab¶

Consider housing_median_age and total_rooms, group them using their quantile, then find the cross tabulate of them,

CHT['houlev'] = ''
C1=CHT.housing_median_age<=CHT.median_income.quantile(.3)
C2=CHT.housing_median_age>=CHT.median_income.quantile(.7)
CHT.loc[C1,'houlev']='L'
CHT.loc[~C1&~C2,'houlev']='M'
CHT.loc[C2,'houlev']='H'

CHT['roomlev'] = ''
C1=CHT.total_rooms<=CHT.total_rooms.quantile(.3)
C2=CHT.total_rooms>=CHT.total_rooms.quantile(.7)
CHT.loc[C1,'roomlev']='L'
CHT.loc[~C1&~C2,'roomlev']='M'
CHT.loc[C2,'roomlev']='H'

pd.crosstab(CHT.roomlev, CHT.houlev, margins=True)

Now use famlev as third variable and find their cross tab.

pd.crosstab([CHT.roomlev, CHT.houlev], CHT.famlev, margins=True)

Accessors¶

The accessors include the built-in functions that come in handy when you want to do some basic function, list is

>>> pd.Series._accessors
{'sparse', 'str', 'cat', 'dt'}

The sparse, cat, and dt can be used to handles sparse matrices, handles categorical data, and handles date formats. 'str' is very handy to string, see below.

String accessor¶

Pandas has useful functions to work with text data, and it also accepts "regular expressions" and "re" module, a complete description can be in (link)[https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html],few examples are given in the below

var=pd.DataFrame({"Name":["Sa", "La", "Ry"],"Family":["Am", "Al", "Am_AL"], "Date":["2019", "2020","2021"]})
var.Family.str.count('Am') # show where it happens (0/1)
var.Family.str.match('Am') # show where it happens (True/False)
var.Family.isin(['Am'])   
var.Family.str.len()       # Compute the length
var.Family.str.isdigit()  # Shows is it number or not. 
var.Family.str.lower()    # Convert to lower case
var.Family.str.upper()    # Convert to  upper case
var.Family.str.replace('Al','Am') # Replace 'Al' with 'Am'
var.Family.str.split("_", expand=True) # Split where "_" observed
var.Family.str.cat(sep=",")            # Concatenate  
var.Family.str.cat(var.Name, join="left",sep=",")   # Concatenate  with another variable
var.columns=var.columns.str.lower() # Convert the column's names to lower case

Normalize\Standardize¶

Sometime, we need to scale data, there are diifernt method.

Z-sccore¶

To normalize, we can use zscore from scipy:

import pandas pd 
var={"A": [1,2,3,4], "B": [4,3,2,1]}
df= pd.DataFrame(data=var,index=(range(4)))
from scipy import stats
stats.zscore(df)

Box-Cox Transformation¶

To run Box-Cox Transformation on data, you can use boxcox from scipy, 'the below run it on aspecific function.

df['A']=stats.boxcox(df.A)[0]

To run on all columsn, run the below code.

for col in df.columns[0:]:
  df[col]=stats.boxcox(df[col])[0]

Min-Max scaling¶

An alternative approach to z-score normalization is min-max which scale data 0-1, minmax_scaling from mlxtend achieve this aim.

from mlxtend.preprocessing import minmax_scaling
var={"A": [1,2,3,4], "B": [4,3,2,1]}
df= pd.DataFrame(data=var,index=(range(4)))
minmax_scaling(df, columns=["A","B"])

This function can be used to numpy's array as weel

import numpy as np
array = np.array([[1, 4], [2, 3], [3, 2], [4, 1]])
minmax_scaling(array, columns=[0, 1])

Pipeline¶

Pipeline in pandas allows to build a sequence of function to run in order on data-frame.

def categ(x,col):
  x[col].quantile(.3)
  x['lev'] = ''
  C1=x[col]<=x[col].quantile(.3)
  C2=x[col]>=x[col].quantile(.7)
  x.loc[C1,'famlev']='L'
  x.loc[~C1&~C2,'famlev']='M'
  x.loc[C2,'famlev']='H'
  return x

def cv(x):
 return (np.mean(x)/np.var(x))

CHT.pipe(categ, col='median_income').pipe(cv)

Convert to Numpy¶

Data frame can easily be converted to an array and a matrix:

CHT.values
CHT.as.matrix

save¶

One of best format for saving data set is CSV data, the following script save the data as CSV without add row number and name, index=False.

CHT.to_csv("CHT.csv", index=False, encoding='utf8')
CHT.to_excel("CHT.xlsx")

Good References¶

http://jonathansoma.com/lede/foundations/ https://github.com/TomAugspurger/effective-pandas https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/ https://chrisalbon.com/python/data_wrangling/pandas_crosstabs/ https://github.com/guipsamora/pandas_exercises