Data Analysis using Python¶

04-Excercise¶

4-1: Categorize the data in the Titanic dataset according to age quantiles at 30% and 70%, and save the column as Age_Category.

In [46]:

  Copied!     
 
import pandas as pd 
import numpy as np 
titanic = pd.read_csv('../data/titanic.csv', sep=",")
quantiles = titanic['Age'].quantile([0.3, 0.7])
categories = pd.cut(titanic['Age'], bins=[0] + quantiles.tolist() + [float('inf')], labels=['0-30%', '30-70%', '70%+'])
titanic['Age_Category'] = categories
import pandas as pd import numpy as np titanic = pd.read_csv('../data/titanic.csv', sep=",") quantiles = titanic['Age'].quantile([0.3, 0.7]) categories = pd.cut(titanic['Age'], bins=[0] + quantiles.tolist() + [float('inf')], labels=['0-30%', '30-70%', '70%+']) titanic['Age_Category'] = categories

4-2: Which rows has age quantiles between 30% and 70%.

In [33]:

  Copied!     
 
titanic.groupby(['Age_Category']).groups['30-70%']
titanic.groupby(['Age_Category']).groups['30-70%']

/var/folders/wg/vccqg9g57mx470xx2txm9slr0000gn/T/ipykernel_17118/2829908899.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  titanic.groupby(['Age_Category']).groups['30-70%']

Out[33]:

Index([  2,   3,   4,   8,  18,  20,  21,  23,  34,  41,
       ...
       870, 872, 874, 880, 881, 883, 884, 886, 889, 890],
      dtype='int64', length=288)

4-3: Display the top rows of the Titanic dataset.

In [34]:

  Copied!     
 
titanic.head()
titanic.head()

Out[34]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Age_Category
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	0-30%
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	70%+
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	30-70%
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	30-70%
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	30-70%

4-4: To compute the coefficient of variation (CV) for the age for each level separately.

In [45]:

  Copied!     
 
def cv(x):
  return (np.mean(x)/np.var(x))

titanic.groupby('Age_Category')['Age'].agg(cv)
def cv(x): return (np.mean(x)/np.var(x)) titanic.groupby('Age_Category')['Age'].agg(cv) 

/var/folders/wg/vccqg9g57mx470xx2txm9slr0000gn/T/ipykernel_17118/3335974191.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  titanic.groupby('Age_Category')['Age'].agg(cv)

Out[45]:

Age_Category
0-30%     0.299863
30-70%    1.827375
70%+      0.615166
Name: Age, dtype: float64

4-5: Check how many null values are in the dataset?

In [36]:

  Copied!     
 
titanic.isnull().sum()
titanic.isnull().sum()

Out[36]:

PassengerId       0
Survived          0
Pclass            0
Name              0
Sex               0
Age             177
SibSp             0
Parch             0
Ticket            0
Fare              0
Cabin           687
Embarked          2
Age_Category    177
dtype: int64

4-6: Fill null values of column Age with linear interpolate.

In [44]:

  Copied!     
 
titanic['Age_fill']=titanic['Age'].interpolate()
titanic['Age_fill']=titanic['Age'].interpolate()

Contents | Previous (3) Exercise | Next (5) Exercise ¶

04 summarizing

Data Analysis using Python¶

04-Excercise¶

Contents | Previous (3) Exercise | Next (5) Exercise¶

Contents | Previous (3) Exercise | Next (5) Exercise ¶