Panda¶
Pandas stands out as one of the finest Python modules for working with datasets. It is built on top of the widely-used NumPy library and offers an array of highly valuable functions for efficient data manipulation. The succinct abbreviation for importing this library is:
import pandas as pd
Data-frame¶
Pandas DataFrames are an incredibly useful format for working with datasets. They represent a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes for both rows and columns. The following code demonstrates how to create a DataFrame from a dictionary:
var={"A": [1,2,0], "B": [2,3,4]}
df= pd.DataFrame(data=var,index=['A', 'Z', 'C'])
df
| A | B | |
|---|---|---|
| A | 1 | 2 |
| Z | 2 | 3 |
| C | 0 | 4 |
Changing the label column is a straightforward task:
raw_data = {'population': [ 1015.0, 1129.0, 333.0, 515.0],'median_income': [ 1.5, 1.8, 1.7, 3.2]}
df=pd.DataFrame(raw_data, columns = ['population', 'median_income'])
To create an empty DataFrame, execute the following code:
df1=pd.DataFrame(columns = ['population', 'median_income'])
df2=pd.DataFrame()
Adding new column¶
Adding a column to a DataFrame is a straightforward operation.
df0=pd.DataFrame([38,40,25,33])
df['Ave_hour']=df0
You can also add new columns to a DataFrame using the assign() method. New columns can be generated using functions, as shown below:
df=df.assign(Ave_hour=df0)
df=df.assign(PI1=lambda x: x['population']*x['median_income'],PI2=df['population']/df['median_income'] )
You can rename a column in a DataFrame using the rename() method.
df.columns = ['population1', 'median_income', 'date2', 'Ave_hour','PI1','PI2']
df=df.rename(columns={'population1': 'pop', 'median_income': 'med_income'},inplace=True)
df
In certain situations, it may be more appropriate to use the data collection time as the index. The script below demonstrates how to convert the data to a time format and then set it as the index.
df = df.set_index(pd.to_datetime(['2019-04-01','2019-05-04','2019-06-01','2019-07-02']))
You can utilize the to_datetime function to ensure that the data is stored in a time format, making it flexible for conversion into various time-related attributes such as year, weekday, and more.
df['date'] = pd.to_datetime(['2019-04-01', '2019-05-04', '2019-06-01', '2019-07-02'])
df['date'].dt.weekday
df1['date'].dt.year
Next, we should save the data. One of the best formats for storing datasets is in CSV format. The following script saves the data as a CSV file without including row numbers and column names by setting index=False.
df.to_csv("test_df.csv", index=False, encoding='utf8')
df.to_excel("test_df.xlsx")
Loading CSV data¶
The pd.read_csv function is used to import data saved in CSV format. The following example demonstrates how to import a CSV file into Python, using California housing data as an example:
source = "../data/CHD.csv"
CHD = pd.read_csv(source, sep=",")
CHD.head() # Displays the first several rows of CHD
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -114.31 | 34.19 | 15 | 5612 | 1283 | 1015 | 472 | 1.4936 | 66900 |
| 1 | -114.47 | 34.40 | 19 | 7650 | 1901 | 1129 | 463 | 1.8200 | 80100 |
| 2 | -114.56 | 33.69 | 17 | 720 | 174 | 333 | 117 | 1.6509 | 85700 |
| 3 | -114.57 | 33.64 | 14 | 1501 | 337 | 515 | 226 | 3.1917 | 73400 |
| 4 | -114.57 | 33.57 | 20 | 1454 | 326 | 624 | 262 | 1.9250 | 65500 |
A DataFrame can indeed store data of different types in its columns. Here are some common data types you can specify:
object: Represents a mix of different data types.
int64: Stores integer numbers.
float64: Handles floating-point numbers.
bool: Stores True/False values.
datetime64: Manages date and time values.
category: Represents a finite number of possible values.
You can define the data type of a column in a DataFrame using various methods, including when reading in data or explicitly specifying the dtype argument.
To check the data types of the columns in a DataFrame, you can use the dtypes attribute, as shown in the following code:
CHD.dtypes # show the type of variables
longitude float64 latitude float64 housing_median_age int64 total_rooms int64 total_bedrooms int64 population int64 households int64 median_income float64 median_house_value int64 dtype: object
Generating summary¶
To examine the types, information, and summary statistics of variables in a DataFrame, you can use the following methods:
df.dtypes: This attribute displays the data types of each column in the DataFrame.
df.describe(): This method provides summary statistics for numeric columns in the DataFrame, including count, mean, standard deviation, minimum, and maximum values.
df.info(): This method gives a concise summary of the DataFrame, including the number of non-null values in each column and the data types. It also provides information about memory usage.
CHD.shape # Diplay the dimension
CHD.columns # Return name of columns
CHD.describe() # generate summary
CHD.info()
CHD.head() # Top rows
CHD.tail() # Bottom rows
o compute various statistics per column in a DataFrame, you can use different aggregation functions depending on your specific requirements. Here's how you can compute some common statistics using Pandas:
print("Count of median_house_value: ", CHD['median_house_value'].count())
print("Mean of median_house_value: ", CHD['median_house_value'].mean())
print("Std of median_house_value: ", CHD['median_house_value'].std())
print("Min of median_house_value: ", CHD['median_house_value'].min())
print("Max of median_house_value: ", CHD['median_house_value'].max())
You can use these functions individually to compute the desired statistics for each column in your DataFrame based on your analysis needs.
Loading xls¶
To read Excel files (.xls format) in Python, you can use the xlrd package, which is a popular library for working with Excel files. However, newer Excel formats (.xlsx), the openpyxl library was more commonly used. You use pd.ExcelFile, see below:
xls_book = pd.ExcelFile('../data/titianic.xlsx')
xls_book
dfs_xls_book = {sheet: xls_book.parse(sheet) for sheet in xls_book.sheet_names}
dfs_xls_book['Sheet1']
| PassengerId | Name | Sex | Age | |
|---|---|---|---|---|
| 0 | 1 | Braund, Mr. Owen Harris | male | 22.0 |
| 1 | 2 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 |
| 2 | 3 | Heikkinen, Miss. Laina | female | 26.0 |
| 3 | 4 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 |
| 4 | 5 | Allen, Mr. William Henry | male | 35.0 |
| ... | ... | ... | ... | ... |
| 886 | 887 | Montvila, Rev. Juozas | male | 27.0 |
| 887 | 888 | Graham, Miss. Margaret Edith | female | 19.0 |
| 888 | 889 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN |
| 889 | 890 | Behr, Mr. Karl Howell | male | 26.0 |
| 890 | 891 | Dooley, Mr. Patrick | male | 32.0 |
891 rows × 4 columns
xls_book = pd.ExcelFile('../data/titianic.xls')
xls_book
dfs_xls_book = {sheet: xls_book.parse(sheet) for sheet in xls_book.sheet_names}
dfs_xls_book['Sheet1']
| PassengerId | Name | Sex | Age | |
|---|---|---|---|---|
| 0 | 1 | Braund, Mr. Owen Harris | male | 22.0 |
| 1 | 2 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 |
| 2 | 3 | Heikkinen, Miss. Laina | female | 26.0 |
| 3 | 4 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 |
| 4 | 5 | Allen, Mr. William Henry | male | 35.0 |
| ... | ... | ... | ... | ... |
| 886 | 887 | Montvila, Rev. Juozas | male | 27.0 |
| 887 | 888 | Graham, Miss. Margaret Edith | female | 19.0 |
| 888 | 889 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN |
| 889 | 890 | Behr, Mr. Karl Howell | male | 26.0 |
| 890 | 891 | Dooley, Mr. Patrick | male | 32.0 |
891 rows × 4 columns