grouping¶
Summaries can be obtained using any grouping variables present in the dataset:
import pandas as pd
import numpy as np
source = "../data/CHD_test.csv"
CHD = pd.read_csv(source, sep=",")
CHD.groupby(['famlev']).groups.keys()
CHD.groupby(['famlev']).groups['H']
CHD.groupby(['famlev']).first()
CHD.groupby(['famlev']).sum()
CHD.groupby(['famlev'])['median_house_value'].sum()
# better output
CHD.groupby(['famlev'])[['median_house_value']].sum()
| median_house_value | |
|---|---|
| famlev | |
| H | 705935355 |
| L | 340441311 |
| M | 617891143 |
The grouped variables are assigned as indices, and to revert them back to regular columns, you can use df.reset_index().
CHD.reset_index()
| index | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | famlev | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 34.19 | 15 | 5612 | 1283 | 1015 | 472 | 1.4936 | 66900 | L |
| 1 | 1 | 34.40 | 19 | 7650 | 1901 | 1129 | 463 | 1.8200 | 80100 | L |
| 2 | 2 | 33.69 | 17 | 720 | 174 | 333 | 117 | 1.6509 | 85700 | L |
| 3 | 3 | 33.64 | 14 | 1501 | 337 | 515 | 226 | 3.1917 | 73400 | M |
| 4 | 4 | 33.57 | 20 | 1454 | 326 | 624 | 262 | 1.9250 | 65500 | L |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7995 | 7995 | 33.92 | 29 | 1436 | 401 | 674 | 343 | 3.6389 | 275000 | M |
| 7996 | 7996 | 33.92 | 22 | 2340 | 584 | 1141 | 554 | 4.5729 | 337500 | H |
| 7997 | 7997 | 33.90 | 39 | 2311 | 404 | 1044 | 380 | 8.4680 | 472100 | H |
| 7998 | 7998 | 33.90 | 39 | 2040 | 336 | 926 | 351 | 7.5552 | 500001 | H |
| 7999 | 7999 | 33.89 | 38 | 4166 | 828 | 1600 | 770 | 6.3861 | 500001 | H |
8000 rows × 10 columns
It is indeed possible to apply even complex functions to your data in pandas. For instance, the following script calculates the coefficient of data:
CHD
| latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | famlev | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 34.19 | 15 | 5612 | 1283 | 1015 | 472 | 1.4936 | 66900 | L |
| 1 | 34.40 | 19 | 7650 | 1901 | 1129 | 463 | 1.8200 | 80100 | L |
| 2 | 33.69 | 17 | 720 | 174 | 333 | 117 | 1.6509 | 85700 | L |
| 3 | 33.64 | 14 | 1501 | 337 | 515 | 226 | 3.1917 | 73400 | M |
| 4 | 33.57 | 20 | 1454 | 326 | 624 | 262 | 1.9250 | 65500 | L |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7995 | 33.92 | 29 | 1436 | 401 | 674 | 343 | 3.6389 | 275000 | M |
| 7996 | 33.92 | 22 | 2340 | 584 | 1141 | 554 | 4.5729 | 337500 | H |
| 7997 | 33.90 | 39 | 2311 | 404 | 1044 | 380 | 8.4680 | 472100 | H |
| 7998 | 33.90 | 39 | 2040 | 336 | 926 | 351 | 7.5552 | 500001 | H |
| 7999 | 33.89 | 38 | 4166 | 828 | 1600 | 770 | 6.3861 | 500001 | H |
8000 rows × 9 columns
def cv(x):
return (np.mean(x)/np.var(x))
aggr = {
'total_rooms': 'sum',
'population': lambda x: cv(x)
}
CHD.groupby('famlev').agg(aggr)
| total_rooms | population | |
|---|---|---|
| famlev | ||
| H | 7732976 | 0.000838 |
| L | 5035074 | 0.001019 |
| M | 8558179 | 0.001310 |
To tidy up the output and make it more presentable, you can format the result,
aggr = {
'total_rooms': ['mean', 'std']
}
grouped = CHD.groupby('famlev').agg(aggr)
grouped.columns = grouped.columns.droplevel(level=0)
grouped.rename(columns={"mean": "total_rooms", "std": "total_rooms"})
grouped.head()
| mean | std | |
|---|---|---|
| famlev | ||
| H | 3222.073333 | 2896.512006 |
| L | 2097.947500 | 1636.991364 |
| M | 2674.430937 | 2069.132235 |
Summarizing data using pivot tables is indeed a powerful way to organize and analyze data in pandas. Pivot tables allow you to aggregate and reshape your data, making it easier to derive insights from complex datasets. You can use the pivot_table() function in pandas to create pivot tables, and it provides flexibility in specifying rows, columns, and aggregation functions to suit your analysis needs.
Here's a simple example of creating a pivot table:
pd.pivot_table(CHD, index=['famlev'], aggfunc=['mean'])
| mean | ||||||||
|---|---|---|---|---|---|---|---|---|
| households | housing_median_age | latitude | median_house_value | median_income | population | total_bedrooms | total_rooms | |
| famlev | ||||||||
| H | 503.607500 | 25.720417 | 33.738283 | 294139.731250 | 6.035106 | 1465.174583 | 531.882500 | 3222.073333 |
| L | 511.647500 | 30.937083 | 33.785250 | 141850.546250 | 2.085039 | 1591.265833 | 564.370833 | 2097.947500 |
| M | 535.627813 | 29.145312 | 33.799312 | 193090.982187 | 3.552313 | 1539.057188 | 577.704062 | 2674.430937 |
In this example, a pivot table is created to summarize the whole column based on the 'famlev' column, aggregating using the mean function. Pivot tables offer great flexibility for summarizing and analyzing data in various ways, making them a valuable tool in data analysis.
Missing Value¶
Handling missing values in a dataset is a crucial part of data preprocessing. You can use the .fillna(new_value) method to replace missing values with a specified new value. Here's a script that generates a dataset with missing values and fills them with the mean of the other values:
raw_data = {'income': [10, np.nan, 14, 16],
'pay': [9, 11, 13, np.nan]}
dat = pd.DataFrame(raw_data, columns=['income', 'pay'])
dat2=dat.copy()
dat.income.fillna(dat.income.mean(),inplace=True)
dat.fillna(dat.mean(),inplace=True)
dat
| income | pay | |
|---|---|---|
| 0 | 10.000000 | 9.0 |
| 1 | 13.333333 | 11.0 |
| 2 | 14.000000 | 13.0 |
| 3 | 16.000000 | 11.0 |
In this script, np.nan is used to represent missing values. The mean of each column is calculated using the .mean() method, and then .fillna() is applied to replace missing values with their respective column means. The inplace=True argument ensures that the changes are applied to the original DataFrame 'dat'.
Indeed, you can use the .isnull() and .notnull() methods to identify which values in a DataFrame or Series are null (missing) or not null (non-missing). Here's how you can use these methods:
null_dat=dat.isnull()
non_null_dat=dat.notnull()
In this example, null_dat will be a DataFrame with True where there are null values in 'df' and False where there are non-null values. Conversely, non_null_dat will be True where there are non-null values and False where there are null values in 'df'. These boolean masks can be used for various purposes, such as filtering or imputing missing data.
pandas provides various methods for filling missing values, and one of those methods is interpolation. Interpolation is useful when you want to estimate missing values based on the values of neighboring data points. Here's an example of how to interpolate NaN values in a DataFrame:
dat2.interpolate()
| income | pay | |
|---|---|---|
| 0 | 10.0 | 9.0 |
| 1 | 12.0 | 11.0 |
| 2 | 14.0 | 13.0 |
| 3 | 16.0 | 13.0 |
In this code, the .interpolate() method is applied to the DataFrame with the default linear interpolation method. You can specify different interpolation methods using the method parameter, such as 'linear', 'polynomial', 'spline', etc., depending on your specific needs. Interpolation is a valuable technique for filling missing data when there is a meaningful relationship between the missing values and the surrounding data points.
Crosstab¶
To create a cross-tabulation (crosstab) between two continuous variables that have been categorized based on quantiles and labeled as 'L' (Low), 'M' (Medium), and 'H' (High), you can use the pd.crosstab function. Here's a step-by-step example:
categories = ['L', 'M', 'H']
q30tr=CHD.total_rooms.quantile(.3)
q70tr=CHD.total_rooms.quantile(.7)
CHD['roomlev'] = pd.cut(CHD['median_income'], bins=[-np.inf, q30mi, q70mi, np.inf], labels=categories)
q30hma=CHD.housing_median_age.quantile(.3)
q70hma=CHD.housing_median_age.quantile(.7)
CHD['houslev'] = pd.cut(CHD['housing_median_age'], bins=[-np.inf, q30hma, q70hma, np.inf], labels=categories)
CHD
| latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | famlev | houlev | roomlev | houslev | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 34.19 | 15 | 5612 | 1283 | 1015 | 472 | 1.4936 | 66900 | L | L | L | L |
| 1 | 34.40 | 19 | 7650 | 1901 | 1129 | 463 | 1.8200 | 80100 | L | L | L | L |
| 2 | 33.69 | 17 | 720 | 174 | 333 | 117 | 1.6509 | 85700 | L | L | L | L |
| 3 | 33.64 | 14 | 1501 | 337 | 515 | 226 | 3.1917 | 73400 | M | M | M | L |
| 4 | 33.57 | 20 | 1454 | 326 | 624 | 262 | 1.9250 | 65500 | L | L | L | L |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7995 | 33.92 | 29 | 1436 | 401 | 674 | 343 | 3.6389 | 275000 | M | M | M | M |
| 7996 | 33.92 | 22 | 2340 | 584 | 1141 | 554 | 4.5729 | 337500 | H | H | H | M |
| 7997 | 33.90 | 39 | 2311 | 404 | 1044 | 380 | 8.4680 | 472100 | H | H | H | H |
| 7998 | 33.90 | 39 | 2040 | 336 | 926 | 351 | 7.5552 | 500001 | H | H | H | H |
| 7999 | 33.89 | 38 | 4166 | 828 | 1600 | 770 | 6.3861 | 500001 | H | H | H | H |
8000 rows × 12 columns
pd.crosstab(CHD.roomlev, CHD.houlev, margins=True)
| houlev | L | M | H | All |
|---|---|---|---|---|
| roomlev | ||||
| L | 2400 | 0 | 0 | 2400 |
| M | 0 | 3200 | 0 | 3200 |
| H | 0 | 0 | 2400 | 2400 |
| All | 2400 | 3200 | 2400 | 8000 |
To generate a cross-tabulation (crosstab) that includes a third variable, 'famlev', along with the previously categorized variables ('housing_median_age' and 'total_rooms'), you can use the pd.crosstab function with all three variables. Here's an example:
pd.crosstab([CHD.roomlev, CHD.houlev], CHD.famlev, margins=True)
| famlev | H | L | M | All | |
|---|---|---|---|---|---|
| roomlev | houlev | ||||
| L | L | 0 | 2400 | 0 | 2400 |
| M | M | 0 | 0 | 3200 | 3200 |
| H | H | 2400 | 0 | 0 | 2400 |
| All | 2400 | 2400 | 3200 | 8000 |