4-1: Categorize the data in the Titanic dataset according to age quantiles at 30% and 70%, and save the column as Age_Category.
In [46]:
Copied!
import pandas as pd
import numpy as np
titanic = pd.read_csv('../data/titanic.csv', sep=",")
quantiles = titanic['Age'].quantile([0.3, 0.7])
categories = pd.cut(titanic['Age'], bins=[0] + quantiles.tolist() + [float('inf')], labels=['0-30%', '30-70%', '70%+'])
titanic['Age_Category'] = categories
import pandas as pd import numpy as np titanic = pd.read_csv('../data/titanic.csv', sep=",") quantiles = titanic['Age'].quantile([0.3, 0.7]) categories = pd.cut(titanic['Age'], bins=[0] + quantiles.tolist() + [float('inf')], labels=['0-30%', '30-70%', '70%+']) titanic['Age_Category'] = categories
4-2: Which rows has age quantiles between 30% and 70%.
In [33]:
Copied!
titanic.groupby(['Age_Category']).groups['30-70%']
titanic.groupby(['Age_Category']).groups['30-70%']
/var/folders/wg/vccqg9g57mx470xx2txm9slr0000gn/T/ipykernel_17118/2829908899.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. titanic.groupby(['Age_Category']).groups['30-70%']
Out[33]:
Index([ 2, 3, 4, 8, 18, 20, 21, 23, 34, 41,
...
870, 872, 874, 880, 881, 883, 884, 886, 889, 890],
dtype='int64', length=288) 4-3: Display the top rows of the Titanic dataset.
In [34]:
Copied!
titanic.head()
titanic.head()
Out[34]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Age_Category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0-30% |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 70%+ |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 30-70% |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 30-70% |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 30-70% |
4-4: To compute the coefficient of variation (CV) for the age for each level separately.
In [45]:
Copied!
def cv(x):
return (np.mean(x)/np.var(x))
titanic.groupby('Age_Category')['Age'].agg(cv)
def cv(x): return (np.mean(x)/np.var(x)) titanic.groupby('Age_Category')['Age'].agg(cv)
/var/folders/wg/vccqg9g57mx470xx2txm9slr0000gn/T/ipykernel_17118/3335974191.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
titanic.groupby('Age_Category')['Age'].agg(cv)
Out[45]:
Age_Category 0-30% 0.299863 30-70% 1.827375 70%+ 0.615166 Name: Age, dtype: float64
4-5: Check how many null values are in the dataset?
In [36]:
Copied!
titanic.isnull().sum()
titanic.isnull().sum()
Out[36]:
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 Age_Category 177 dtype: int64
4-6: Fill null values of column Age with linear interpolate.
In [44]:
Copied!
titanic['Age_fill']=titanic['Age'].interpolate()
titanic['Age_fill']=titanic['Age'].interpolate()