Consider the Titanic dataset¶
The available metadata of the Titanic dataset provides the following information:
| VARIABLE | DESCRIPTION |
|---|---|
PassengerId | Passenger id |
Survived | 0 = No; 1 = Yes |
Pclass | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) |
Name | Passenger name |
Sex | Passenger gender |
Age | Passenger age |
SibSp | Number of Siblings/Spouses Aboard |
Parch | Number of Parents/Children Aboard |
Ticket | Ticket Number |
Fare | Passenger fare |
Cabin | Cabin number |
Embarked | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) |
2-1: Import the titanic data
In [1]:
Copied!
import pandas as pd
titanic = pd.read_csv('../data/titanic.csv', sep=",")
import pandas as pd titanic = pd.read_csv('../data/titanic.csv', sep=",")
2-3: Determine the number of records (rows) and columns in a dataset?
In [3]:
Copied!
# Get the number of records (rows) and columns
num_records, num_columns = titanic.shape
# Get the number of records (rows) and columns num_records, num_columns = titanic.shape
Out[3]:
(891, 12)
2-4: Display the top and bottom rows of the dataset.
In [4]:
Copied!
titanic.head() # Top rows
titanic.tail() # Bottom rows
titanic.head() # Top rows titanic.tail() # Bottom rows
Out[4]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.00 | NaN | S |
| 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.00 | B42 | S |
| 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.45 | NaN | S |
| 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.00 | C148 | C |
| 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.75 | NaN | Q |
2-5: What is the data type of the columns in the dataset?
In [5]:
Copied!
titanic.dtypes
titanic.dtypes
Out[5]:
PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object
2-6: What are the column labels or names in the dataset?
In [6]:
Copied!
titanic.columns
titanic.columns
Out[6]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')