Data Analysis using Python¶

04-Summarizing¶

Objectives:

Employing function on dataframe
Running grouping on dataframe
Working with Missing Value
Generating Crosstab

Contents:

Grouping
Missing Value
Crosstab

grouping¶

Summaries can be obtained using any grouping variables present in the dataset:

In [8]:

  Copied!     
 
import pandas as pd
import numpy as np 
source = "../data/CHD_test.csv"
CHD = pd.read_csv(source, sep=",")
CHD.groupby(['famlev']).groups.keys()
CHD.groupby(['famlev']).groups['H']
CHD.groupby(['famlev']).first()

CHD.groupby(['famlev']).sum()

CHD.groupby(['famlev'])['median_house_value'].sum()
# better output
CHD.groupby(['famlev'])[['median_house_value']].sum()
import pandas as pd import numpy as np source = "../data/CHD_test.csv" CHD = pd.read_csv(source, sep=",") CHD.groupby(['famlev']).groups.keys() CHD.groupby(['famlev']).groups['H'] CHD.groupby(['famlev']).first() CHD.groupby(['famlev']).sum() CHD.groupby(['famlev'])['median_house_value'].sum() # better output CHD.groupby(['famlev'])[['median_house_value']].sum()

Out[8]:

	median_house_value
famlev
H	705935355
L	340441311
M	617891143

The grouped variables are assigned as indices, and to revert them back to regular columns, you can use df.reset_index().

In [4]:

  Copied!     
 
CHD.reset_index()
CHD.reset_index()

Out[4]:

	index	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	famlev
0	0	34.19	15	5612	1283	1015	472	1.4936	66900	L
1	1	34.40	19	7650	1901	1129	463	1.8200	80100	L
2	2	33.69	17	720	174	333	117	1.6509	85700	L
3	3	33.64	14	1501	337	515	226	3.1917	73400	M
4	4	33.57	20	1454	326	624	262	1.9250	65500	L
...	...	...	...	...	...	...	...	...	...	...
7995	7995	33.92	29	1436	401	674	343	3.6389	275000	M
7996	7996	33.92	22	2340	584	1141	554	4.5729	337500	H
7997	7997	33.90	39	2311	404	1044	380	8.4680	472100	H
7998	7998	33.90	39	2040	336	926	351	7.5552	500001	H
7999	7999	33.89	38	4166	828	1600	770	6.3861	500001	H

8000 rows × 10 columns

It is indeed possible to apply even complex functions to your data in pandas. For instance, the following script calculates the coefficient of data:

In [6]:

  Copied!     
 
CHD
CHD

Out[6]:

	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	famlev
0	34.19	15	5612	1283	1015	472	1.4936	66900	L
1	34.40	19	7650	1901	1129	463	1.8200	80100	L
2	33.69	17	720	174	333	117	1.6509	85700	L
3	33.64	14	1501	337	515	226	3.1917	73400	M
4	33.57	20	1454	326	624	262	1.9250	65500	L
...	...	...	...	...	...	...	...	...	...
7995	33.92	29	1436	401	674	343	3.6389	275000	M
7996	33.92	22	2340	584	1141	554	4.5729	337500	H
7997	33.90	39	2311	404	1044	380	8.4680	472100	H
7998	33.90	39	2040	336	926	351	7.5552	500001	H
7999	33.89	38	4166	828	1600	770	6.3861	500001	H

8000 rows × 9 columns

In [10]:

  Copied!     
 
def cv(x):
 return (np.mean(x)/np.var(x))

aggr = {
    'total_rooms': 'sum',
    'population': lambda x: cv(x)
}

CHD.groupby('famlev').agg(aggr)
def cv(x): return (np.mean(x)/np.var(x)) aggr = { 'total_rooms': 'sum', 'population': lambda x: cv(x) } CHD.groupby('famlev').agg(aggr)

Out[10]:

	total_rooms	population
famlev
H	7732976	0.000838
L	5035074	0.001019
M	8558179	0.001310

To tidy up the output and make it more presentable, you can format the result,

In [11]:

  Copied!     
 
aggr = {
    'total_rooms': ['mean', 'std']
}
grouped = CHD.groupby('famlev').agg(aggr)
grouped.columns = grouped.columns.droplevel(level=0)
grouped.rename(columns={"mean": "total_rooms", "std": "total_rooms"})
grouped.head()
aggr = { 'total_rooms': ['mean', 'std'] } grouped = CHD.groupby('famlev').agg(aggr) grouped.columns = grouped.columns.droplevel(level=0) grouped.rename(columns={"mean": "total_rooms", "std": "total_rooms"}) grouped.head()

Out[11]:

	mean	std
famlev
H	3222.073333	2896.512006
L	2097.947500	1636.991364
M	2674.430937	2069.132235

Summarizing data using pivot tables is indeed a powerful way to organize and analyze data in pandas. Pivot tables allow you to aggregate and reshape your data, making it easier to derive insights from complex datasets. You can use the pivot_table() function in pandas to create pivot tables, and it provides flexibility in specifying rows, columns, and aggregation functions to suit your analysis needs.

Here's a simple example of creating a pivot table:

In [12]:

  Copied!     
 
pd.pivot_table(CHD, index=['famlev'], aggfunc=['mean'])
pd.pivot_table(CHD, index=['famlev'], aggfunc=['mean'])

Out[12]:

	mean
	households	housing_median_age	latitude	median_house_value	median_income	population	total_bedrooms	total_rooms
famlev
H	503.607500	25.720417	33.738283	294139.731250	6.035106	1465.174583	531.882500	3222.073333
L	511.647500	30.937083	33.785250	141850.546250	2.085039	1591.265833	564.370833	2097.947500
M	535.627813	29.145312	33.799312	193090.982187	3.552313	1539.057188	577.704062	2674.430937

In this example, a pivot table is created to summarize the whole column based on the 'famlev' column, aggregating using the mean function. Pivot tables offer great flexibility for summarizing and analyzing data in various ways, making them a valuable tool in data analysis.

Missing Value¶

Handling missing values in a dataset is a crucial part of data preprocessing. You can use the .fillna(new_value) method to replace missing values with a specified new value. Here's a script that generates a dataset with missing values and fills them with the mean of the other values:

In [21]:

  Copied!     
 
raw_data = {'income': [10, np.nan, 14, 16],
               'pay': [9, 11, 13, np.nan]}
dat = pd.DataFrame(raw_data, columns=['income', 'pay'])
dat2=dat.copy()
dat.income.fillna(dat.income.mean(),inplace=True)
dat.fillna(dat.mean(),inplace=True)
dat
raw_data = {'income': [10, np.nan, 14, 16], 'pay': [9, 11, 13, np.nan]} dat = pd.DataFrame(raw_data, columns=['income', 'pay']) dat2=dat.copy() dat.income.fillna(dat.income.mean(),inplace=True) dat.fillna(dat.mean(),inplace=True) dat

Out[21]:

	income	pay
0	10.000000	9.0
1	13.333333	11.0
2	14.000000	13.0
3	16.000000	11.0

In this script, np.nan is used to represent missing values. The mean of each column is calculated using the .mean() method, and then .fillna() is applied to replace missing values with their respective column means. The inplace=True argument ensures that the changes are applied to the original DataFrame 'dat'.

Indeed, you can use the .isnull() and .notnull() methods to identify which values in a DataFrame or Series are null (missing) or not null (non-missing). Here's how you can use these methods:

In [20]:

  Copied!     
 
null_dat=dat.isnull()
non_null_dat=dat.notnull()
null_dat=dat.isnull() non_null_dat=dat.notnull()

In this example, null_dat will be a DataFrame with True where there are null values in 'df' and False where there are non-null values. Conversely, non_null_dat will be True where there are non-null values and False where there are null values in 'df'. These boolean masks can be used for various purposes, such as filtering or imputing missing data.

pandas provides various methods for filling missing values, and one of those methods is interpolation. Interpolation is useful when you want to estimate missing values based on the values of neighboring data points. Here's an example of how to interpolate NaN values in a DataFrame:

In [22]:

  Copied!     
 
dat2.interpolate()
dat2.interpolate() 

Out[22]:

	income	pay
0	10.0	9.0
1	12.0	11.0
2	14.0	13.0
3	16.0	13.0

In this code, the .interpolate() method is applied to the DataFrame with the default linear interpolation method. You can specify different interpolation methods using the method parameter, such as 'linear', 'polynomial', 'spline', etc., depending on your specific needs. Interpolation is a valuable technique for filling missing data when there is a meaningful relationship between the missing values and the surrounding data points.

Crosstab¶

To create a cross-tabulation (crosstab) between two continuous variables that have been categorized based on quantiles and labeled as 'L' (Low), 'M' (Medium), and 'H' (High), you can use the pd.crosstab function. Here's a step-by-step example:

In [31]:

  Copied!     
 
categories = ['L', 'M', 'H']

q30tr=CHD.total_rooms.quantile(.3)
q70tr=CHD.total_rooms.quantile(.7)

CHD['roomlev'] = pd.cut(CHD['median_income'], bins=[-np.inf, q30mi, q70mi, np.inf], labels=categories)

q30hma=CHD.housing_median_age.quantile(.3)
q70hma=CHD.housing_median_age.quantile(.7)

CHD['houslev'] = pd.cut(CHD['housing_median_age'], bins=[-np.inf, q30hma, q70hma, np.inf], labels=categories)
CHD
categories = ['L', 'M', 'H'] q30tr=CHD.total_rooms.quantile(.3) q70tr=CHD.total_rooms.quantile(.7) CHD['roomlev'] = pd.cut(CHD['median_income'], bins=[-np.inf, q30mi, q70mi, np.inf], labels=categories) q30hma=CHD.housing_median_age.quantile(.3) q70hma=CHD.housing_median_age.quantile(.7) CHD['houslev'] = pd.cut(CHD['housing_median_age'], bins=[-np.inf, q30hma, q70hma, np.inf], labels=categories) CHD

Out[31]:

	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	famlev	houlev	roomlev	houslev
0	34.19	15	5612	1283	1015	472	1.4936	66900	L	L	L	L
1	34.40	19	7650	1901	1129	463	1.8200	80100	L	L	L	L
2	33.69	17	720	174	333	117	1.6509	85700	L	L	L	L
3	33.64	14	1501	337	515	226	3.1917	73400	M	M	M	L
4	33.57	20	1454	326	624	262	1.9250	65500	L	L	L	L
...	...	...	...	...	...	...	...	...	...	...	...	...
7995	33.92	29	1436	401	674	343	3.6389	275000	M	M	M	M
7996	33.92	22	2340	584	1141	554	4.5729	337500	H	H	H	M
7997	33.90	39	2311	404	1044	380	8.4680	472100	H	H	H	H
7998	33.90	39	2040	336	926	351	7.5552	500001	H	H	H	H
7999	33.89	38	4166	828	1600	770	6.3861	500001	H	H	H	H

8000 rows × 12 columns

In [32]:

  Copied!     
 
pd.crosstab(CHD.roomlev, CHD.houlev, margins=True)
pd.crosstab(CHD.roomlev, CHD.houlev, margins=True)

Out[32]:

houlev	L	M	H	All
roomlev
L	2400	0	0	2400
M	0	3200	0	3200
H	0	0	2400	2400
All	2400	3200	2400	8000

To generate a cross-tabulation (crosstab) that includes a third variable, 'famlev', along with the previously categorized variables ('housing_median_age' and 'total_rooms'), you can use the pd.crosstab function with all three variables. Here's an example:

In [33]:

  Copied!     
 
pd.crosstab([CHD.roomlev, CHD.houlev], CHD.famlev, margins=True)
pd.crosstab([CHD.roomlev, CHD.houlev], CHD.famlev, margins=True)

Out[33]:

	famlev	H	L	M	All
roomlev	houlev
L	L	0	2400	0	2400
M	M	0	0	3200	3200
H	H	2400	0	0	2400
All		2400	2400	3200	8000

Contents | Previous(3) manipulating-data-frame | Next (5) Merging data frame ¶

Exercise 04-Summarizing