Data Analysis using Python¶

03-Manipulating data-frame¶

Objectives

Selecting subset of data frame
Labeling the dataframe
Reassigning the values
Running condition using ==, !=, >, <, >=, <=

Contens:

Selecting Data
Subsets of Rows and Columns
Generating descriptive statistics

Selecting part of Data¶

To choose a column, employ either the column label enclosed within square brackets [] or with a period following the dataframe name.

In [ ]:

  Copied!     
 
CHD.median_house_value
CHD['median_house_value']
CHD.median_house_value CHD['median_house_value']

When you select a single column from a pandas DataFrame, you'll obtain a pandas Series, which is a 1-dimensional data structure. You can access the data either by using the column's variable name or by specifying the indices through .iloc and .loc link. .iloc is designed for integer-based selection and should be used with integer indices. On the other hand, .loc is primarily label-based but can also be employed with a boolean array for selection.

In [ ]:

  Copied!     
 
CHD.longitude
CHD['longitude']
CHD.iloc[:, 1]
CHD.iloc[:, [1, 3]]
CHD.longitude CHD['longitude'] CHD.iloc[:, 1] CHD.iloc[:, [1, 3]]

Subsets of Rows and Columns¶

To choose a portion of a row, you can utilize iloc[row_index, :], and you can also filter rows using logical values.

In [ ]:

  Copied!     
 
CHD.iloc[2:10]
CHD.iloc[2:10,:]
CHD.iloc[[2, 10], :]
CHD[CHD.iloc[:,1]<34]
CHD.iloc[2:10] CHD.iloc[2:10,:] CHD.iloc[[2, 10], :] CHD[CHD.iloc[:,1]<34]

To retrieve a part of a row using a boolean variable, you should use .loc because it works with boolean indexing. Regarding categorizing median_income into three categories based on given value of 2.7 and 4.4, you can do it as follows:

In [ ]:

  Copied!     
 
CHD['famlev'] = ''
C1=CHD.median_income<=2.7
C2=CHD.median_income>=4.4
CHD.loc[C1,'famlev']='L'
CHD.loc[~C1&~C2,'famlev']='M'
CHD.loc[C2,'famlev']='H'
CHD['famlev'] = '' C1=CHD.median_income<=2.7 C2=CHD.median_income>=4.4 CHD.loc[C1,'famlev']='L' CHD.loc[~C1&~C2,'famlev']='M' CHD.loc[C2,'famlev']='H'

This code can be rewritten as

In [ ]:

  Copied!     
 
# Create a new column to store the categories
CHD['famlev2'] = pd.cut(df['median_income'], bins=[0, 2.7, 4.4, np.inf], labels=['L', 'M', 'H'])
# Create a new column to store the categories CHD['famlev2'] = pd.cut(df['median_income'], bins=[0, 2.7, 4.4, np.inf], labels=['L', 'M', 'H']) 

Generating descriptinve statistics¶

While .describe() can provide a summary of variables, you can extract a more specific summary of individual columns, as shown below.

In [ ]:

  Copied!     
 
CHD.count
CHD[CHD.iloc[:, 1] < 34].nunique()
CHD.count CHD[CHD.iloc[:, 1] < 34].nunique()

The following table includes the useful functions.

Function	Description

Regarding categorizing median_income into two categories based on quartiles, you can do it as follows:

In [ ]:

  Copied!     
 
# Calculate quartile boundaries
q30 = np.percentile(df['median_income'], 30)
q70 = np.percentile(df['median_income'], 70)

df['famlev'] = pd.cut(df['median_income'], bins=[0, q30, q70, np.inf], labels=['L', 'M', 'H'])
# Calculate quartile boundaries q30 = np.percentile(df['median_income'], 30) q70 = np.percentile(df['median_income'], 70) df['famlev'] = pd.cut(df['median_income'], bins=[0, q30, q70, np.inf], labels=['L', 'M', 'H']) 

In this case, we utilized .loc, where we specify column labels to retrieve columns instead of using positional indices. Note that you can also use double square brackets [[]] to apply different conditions to the data.

In [ ]:

  Copied!     
 
CHD['median_house_value'][CHD['famlev'] == 'M'].mean()
CHD['median_house_value'][CHD['famlev'] == 'M'].mean() 

Indeed, you can use np.where to select or search for data in a NumPy array based on specific conditions. It evaluates the conditions and returns the data that satisfy those conditions. This can be a valuable approach for more complex and custom data selection requirements.

In [ ]:

  Copied!     
 
CHD_R = CHD[['total_rooms', 'total_bedrooms']]
CHD_R.where(CHD.total_rooms < 1000)
CHD_R.where(CHD.total_rooms < 1000, 0)
con = CHD_R < 1000
CHD_R.where(con, -999)
CHD_R = CHD[['total_rooms', 'total_bedrooms']] CHD_R.where(CHD.total_rooms < 1000) CHD_R.where(CHD.total_rooms < 1000, 0) con = CHD_R < 1000 CHD_R.where(con, -999) 

If you want to select specific elements in data-frame, use .isin(), the following select element where 'famlev=M'

If you want to select specific elements in a DataFrame based on a condition like 'famlev=M', you can use the .isin() method. However, it's worth noting that .isin() is typically used to filter rows rather than individual elements. To filter rows where a specific column matches a certain value, you can do something like this:

In [ ]:

  Copied!     
 
import numpy as np 
np.where(CHD.loc[:,'famlev'].isin(['M']))
import numpy as np np.where(CHD.loc[:,'famlev'].isin(['M']))

This code filters rows in the DataFrame where the 'famlev' column has the value 'M'.

You can use np.where to create a new column in a DataFrame based on specified conditions. Here's an example of how you can do that:

In [ ]:

  Copied!     
 
CHD['size'] = np.where(CHD.total_rooms < 1000, 'small', 'big')
CHD['size'] = np.where(CHD.total_rooms < 1000, 'small', 'big')

In this example, a new column 'size' is created where the values are determined based on the condition CHD.total_rooms < 1000. If the condition is true, it assigns 'small' to the 'size' column; otherwise, it assigns 'big'. You can adjust the condition and the values as needed for your specific use case.

You can perform simple operations on a DataFrame using list comprehension as well.

In [ ]:

  Copied!     
 
CHD['size']=['small' if x<100  else 'big'  for x in CHD['total_rooms']]
CHD['size']=['small' if x<100 else 'big' for x in CHD['total_rooms']]

To remove rows and columns from a DataFrame, you can use the .drop method.

In [ ]:

  Copied!     
 
CHD.drop([0,5], axis=0)
CHD.drop('longitude',axis=1, inplace=True)
CHD.drop([0,5], axis=0) CHD.drop('longitude',axis=1, inplace=True)

Note that using the argument inplace=True applies the change to the original data directly.

To replace values in a DataFrame, you can use the df.replace() method.

In [ ]:

  Copied!     
 
CHD['famlev'].replace('L','Low').replace('M','Middle').replace('H','High')
CHD['famlev'].replace('L','Low').replace('M','Middle').replace('H','High')

You can sort your data by a specific column using:

In [ ]:

  Copied!     
 
CHD.sort_values(by='size')
CHD.sort_values(by='size')

Don't forget to save the data after making changes or sorting it:

In [ ]:

  Copied!     
 
CHD.to_csv("/Volumes/F/progwr/python/python_tech/analysis_data_using_python/data/CHD_test.csv",
           index=False, encoding='utf8')
CHD.to_csv("/Volumes/F/progwr/python/python_tech/analysis_data_using_python/data/CHD_test.csv", index=False, encoding='utf8')

Contents | Previous (3) Manipulating data frame | Next (4) Summarizing ¶

Exercise 03-Manipulating data-frame

Data Analysis using Python¶

03-Manipulating data-frame¶

Selecting part of Data¶

Subsets of Rows and Columns¶

Generating descriptinve statistics¶

Contents | Previous (3) Manipulating data frame | Next (4) Summarizing¶

Contents | Previous (3) Manipulating data frame | Next (4) Summarizing ¶