Data Analysis using Python¶

02-Working with Data¶

Objectives:

Creating dataframe
loading Dataframe using read_csv
Describing dataframe
Saving dataframe

Contents:

Panda
Data-frame
Adding new column
Loading CSV data
Generating summary
Loading xls

Panda¶

Pandas stands out as one of the finest Python modules for working with datasets. It is built on top of the widely-used NumPy library and offers an array of highly valuable functions for efficient data manipulation. The succinct abbreviation for importing this library is:

In [3]:

  Copied!     
 
import pandas as pd
import pandas as pd

Data-frame¶

Pandas DataFrames are an incredibly useful format for working with datasets. They represent a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes for both rows and columns. The following code demonstrates how to create a DataFrame from a dictionary:

In [24]:

  Copied!     
 
var={"A": [1,2,0], "B": [2,3,4]}
df= pd.DataFrame(data=var,index=['A', 'Z', 'C'])
df
var={"A": [1,2,0], "B": [2,3,4]} df= pd.DataFrame(data=var,index=['A', 'Z', 'C']) df

Out[24]:

	A	B
A	1	2
Z	2	3
C	0	4

Changing the label column is a straightforward task:

In [25]:

  Copied!     
 
raw_data = {'population': [ 1015.0, 1129.0, 333.0,  515.0],'median_income': [ 1.5, 1.8,  1.7,  3.2]}
df=pd.DataFrame(raw_data, columns = ['population', 'median_income'])
raw_data = {'population': [ 1015.0, 1129.0, 333.0, 515.0],'median_income': [ 1.5, 1.8, 1.7, 3.2]} df=pd.DataFrame(raw_data, columns = ['population', 'median_income'])

To create an empty DataFrame, execute the following code:

In [26]:

  Copied!     
 
df1=pd.DataFrame(columns = ['population', 'median_income'])
df2=pd.DataFrame()
df1=pd.DataFrame(columns = ['population', 'median_income']) df2=pd.DataFrame()

Adding new column¶

Adding a column to a DataFrame is a straightforward operation.

In [27]:

  Copied!     
 
df0=pd.DataFrame([38,40,25,33])
df['Ave_hour']=df0
df0=pd.DataFrame([38,40,25,33]) df['Ave_hour']=df0

You can also add new columns to a DataFrame using the assign() method. New columns can be generated using functions, as shown below:

In [ ]:

  Copied!     
 
df=df.assign(Ave_hour=df0)
df=df.assign(PI1=lambda x: x['population']*x['median_income'],PI2=df['population']/df['median_income'] )
df=df.assign(Ave_hour=df0) df=df.assign(PI1=lambda x: x['population']*x['median_income'],PI2=df['population']/df['median_income'] )

You can rename a column in a DataFrame using the rename() method.

In [ ]:

  Copied!     
 
df.columns = ['population1', 'median_income', 'date2', 'Ave_hour','PI1','PI2']
df=df.rename(columns={'population1': 'pop', 'median_income': 'med_income'},inplace=True)
df
df.columns = ['population1', 'median_income', 'date2', 'Ave_hour','PI1','PI2'] df=df.rename(columns={'population1': 'pop', 'median_income': 'med_income'},inplace=True) df

In certain situations, it may be more appropriate to use the data collection time as the index. The script below demonstrates how to convert the data to a time format and then set it as the index.

In [ ]:

  Copied!     
 
df = df.set_index(pd.to_datetime(['2019-04-01','2019-05-04','2019-06-01','2019-07-02']))
df = df.set_index(pd.to_datetime(['2019-04-01','2019-05-04','2019-06-01','2019-07-02']))

You can utilize the to_datetime function to ensure that the data is stored in a time format, making it flexible for conversion into various time-related attributes such as year, weekday, and more.

In [ ]:

  Copied!     
 
df['date'] = pd.to_datetime(['2019-04-01', '2019-05-04', '2019-06-01', '2019-07-02'])
df['date'].dt.weekday
df1['date'].dt.year
df['date'] = pd.to_datetime(['2019-04-01', '2019-05-04', '2019-06-01', '2019-07-02']) df['date'].dt.weekday df1['date'].dt.year

Next, we should save the data. One of the best formats for storing datasets is in CSV format. The following script saves the data as a CSV file without including row numbers and column names by setting index=False.

In [ ]:

  Copied!     
 
df.to_csv("test_df.csv", index=False, encoding='utf8')
df.to_excel("test_df.xlsx")
df.to_csv("test_df.csv", index=False, encoding='utf8') df.to_excel("test_df.xlsx")

Loading CSV data¶

The pd.read_csv function is used to import data saved in CSV format. The following example demonstrates how to import a CSV file into Python, using California housing data as an example:

In [4]:

  Copied!     
 
source = "../data/CHD.csv"
CHD = pd.read_csv(source, sep=",")
CHD.head()  # Displays the first several rows of CHD
source = "../data/CHD.csv" CHD = pd.read_csv(source, sep=",") CHD.head() # Displays the first several rows of CHD

Out[4]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-114.31	34.19	15	5612	1283	1015	472	1.4936	66900
1	-114.47	34.40	19	7650	1901	1129	463	1.8200	80100
2	-114.56	33.69	17	720	174	333	117	1.6509	85700
3	-114.57	33.64	14	1501	337	515	226	3.1917	73400
4	-114.57	33.57	20	1454	326	624	262	1.9250	65500

A DataFrame can indeed store data of different types in its columns. Here are some common data types you can specify:

object: Represents a mix of different data types.
int64: Stores integer numbers.
float64: Handles floating-point numbers.
bool: Stores True/False values.
datetime64: Manages date and time values.
category: Represents a finite number of possible values.
You can define the data type of a column in a DataFrame using various methods, including when reading in data or explicitly specifying the dtype argument.

To check the data types of the columns in a DataFrame, you can use the dtypes attribute, as shown in the following code:

In [5]:

  Copied!     
 
CHD.dtypes  # show the type of variables
CHD.dtypes # show the type of variables

Out[5]:

longitude             float64
latitude              float64
housing_median_age      int64
total_rooms             int64
total_bedrooms          int64
population              int64
households              int64
median_income         float64
median_house_value      int64
dtype: object

Generating summary¶

To examine the types, information, and summary statistics of variables in a DataFrame, you can use the following methods:

df.dtypes: This attribute displays the data types of each column in the DataFrame.
df.describe(): This method provides summary statistics for numeric columns in the DataFrame, including count, mean, standard deviation, minimum, and maximum values.
df.info(): This method gives a concise summary of the DataFrame, including the number of non-null values in each column and the data types. It also provides information about memory usage.

In [ ]:

  Copied!     
 
CHD.shape # Diplay the dimension
CHD.columns  # Return name of columns
CHD.describe()  # generate summary
CHD.info() 
CHD.head() # Top rows
CHD.tail() # Bottom rows
 CHD.shape # Diplay the dimension CHD.columns # Return name of columns CHD.describe() # generate summary CHD.info() CHD.head() # Top rows CHD.tail() # Bottom rows

o compute various statistics per column in a DataFrame, you can use different aggregation functions depending on your specific requirements. Here's how you can compute some common statistics using Pandas:

In [ ]:

  Copied!     
 
print("Count of median_house_value:    ", CHD['median_house_value'].count())
print("Mean of median_house_value:     ", CHD['median_house_value'].mean())
print("Std of median_house_value:      ", CHD['median_house_value'].std())
print("Min of median_house_value:      ", CHD['median_house_value'].min())
print("Max of median_house_value:      ", CHD['median_house_value'].max())
print("Count of median_house_value: ", CHD['median_house_value'].count()) print("Mean of median_house_value: ", CHD['median_house_value'].mean()) print("Std of median_house_value: ", CHD['median_house_value'].std()) print("Min of median_house_value: ", CHD['median_house_value'].min()) print("Max of median_house_value: ", CHD['median_house_value'].max())

You can use these functions individually to compute the desired statistics for each column in your DataFrame based on your analysis needs.

Loading xls¶

To read Excel files (.xls format) in Python, you can use the xlrd package, which is a popular library for working with Excel files. However, newer Excel formats (.xlsx), the openpyxl library was more commonly used. You use pd.ExcelFile, see below:

In [40]:

  Copied!     
 
xls_book = pd.ExcelFile('../data/titianic.xlsx')
xls_book
dfs_xls_book = {sheet: xls_book.parse(sheet) for sheet in xls_book.sheet_names}
dfs_xls_book['Sheet1']
xls_book = pd.ExcelFile('../data/titianic.xlsx') xls_book dfs_xls_book = {sheet: xls_book.parse(sheet) for sheet in xls_book.sheet_names} dfs_xls_book['Sheet1']

Out[40]:

	PassengerId	Name	Sex	Age
0	1	Braund, Mr. Owen Harris	male	22.0
1	2	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0
2	3	Heikkinen, Miss. Laina	female	26.0
3	4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0
4	5	Allen, Mr. William Henry	male	35.0
...	...	...	...	...
886	887	Montvila, Rev. Juozas	male	27.0
887	888	Graham, Miss. Margaret Edith	female	19.0
888	889	Johnston, Miss. Catherine Helen "Carrie"	female	NaN
889	890	Behr, Mr. Karl Howell	male	26.0
890	891	Dooley, Mr. Patrick	male	32.0

891 rows × 4 columns

In [41]:

  Copied!     
 
xls_book = pd.ExcelFile('../data/titianic.xls')
xls_book
dfs_xls_book = {sheet: xls_book.parse(sheet) for sheet in xls_book.sheet_names}
dfs_xls_book['Sheet1']
xls_book = pd.ExcelFile('../data/titianic.xls') xls_book dfs_xls_book = {sheet: xls_book.parse(sheet) for sheet in xls_book.sheet_names} dfs_xls_book['Sheet1']

Out[41]:

	PassengerId	Name	Sex	Age
0	1	Braund, Mr. Owen Harris	male	22.0
1	2	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0
2	3	Heikkinen, Miss. Laina	female	26.0
3	4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0
4	5	Allen, Mr. William Henry	male	35.0
...	...	...	...	...
886	887	Montvila, Rev. Juozas	male	27.0
887	888	Graham, Miss. Margaret Edith	female	19.0
888	889	Johnston, Miss. Catherine Helen "Carrie"	female	NaN
889	890	Behr, Mr. Karl Howell	male	26.0
890	891	Dooley, Mr. Patrick	male	32.0

891 rows × 4 columns

Contents | Previous (1) starting with data | Next (3) manipulating data frame ¶

Exercise 02-Working with Data

Data Analysis using Python¶

02-Working with Data¶

Panda¶

Data-frame¶

Adding new column¶

Loading CSV data¶

Generating summary¶

Loading xls¶

Contents | Previous (1) starting with data | Next (3) manipulating data frame¶

Contents | Previous (1) starting with data | Next (3) manipulating data frame ¶