Selecting part of Data¶
To choose a column, employ either the column label enclosed within square brackets [] or with a period following the dataframe name.
CHD.median_house_value
CHD['median_house_value']
When you select a single column from a pandas DataFrame, you'll obtain a pandas Series, which is a 1-dimensional data structure. You can access the data either by using the column's variable name or by specifying the indices through .iloc and .loc link. .iloc is designed for integer-based selection and should be used with integer indices. On the other hand, .loc is primarily label-based but can also be employed with a boolean array for selection.
CHD.longitude
CHD['longitude']
CHD.iloc[:, 1]
CHD.iloc[:, [1, 3]]
Subsets of Rows and Columns¶
To choose a portion of a row, you can utilize iloc[row_index, :], and you can also filter rows using logical values.
CHD.iloc[2:10]
CHD.iloc[2:10,:]
CHD.iloc[[2, 10], :]
CHD[CHD.iloc[:,1]<34]
To retrieve a part of a row using a boolean variable, you should use .loc because it works with boolean indexing. Regarding categorizing median_income into three categories based on given value of 2.7 and 4.4, you can do it as follows:
CHD['famlev'] = ''
C1=CHD.median_income<=2.7
C2=CHD.median_income>=4.4
CHD.loc[C1,'famlev']='L'
CHD.loc[~C1&~C2,'famlev']='M'
CHD.loc[C2,'famlev']='H'
This code can be rewritten as
# Create a new column to store the categories
CHD['famlev2'] = pd.cut(df['median_income'], bins=[0, 2.7, 4.4, np.inf], labels=['L', 'M', 'H'])
Generating descriptinve statistics¶
While .describe() can provide a summary of variables, you can extract a more specific summary of individual columns, as shown below.
CHD.count
CHD[CHD.iloc[:, 1] < 34].nunique()
The following table includes the useful functions.
| Function | Description |
|---|
count| Number of non-null observations sum | Sum of values mean |Mean of value mad | Mean absolute deviation median| median of values min |Minimum max |Maximum mode |Mode abs | Absolute Value prod | Product of values std |Unbiased standard deviation var |Unbiased variance sem |Unbiased standard error of the mean skew | Unbiased skewness (3rd moment) kurt | Unbiased kurtosis (4th moment) quantile | Sample quantile (value at %) cumsum | Cumulative sum cumprod| Cumulative product cummax | Cumulative maximum cummin | Cumulative minimum nunique| number of unique elements value_counts| Counts of unique values cov| Calculate the covariance between columns corr| Calculate the correlation between columns
Regarding categorizing median_income into two categories based on quartiles, you can do it as follows:
# Calculate quartile boundaries
q30 = np.percentile(df['median_income'], 30)
q70 = np.percentile(df['median_income'], 70)
df['famlev'] = pd.cut(df['median_income'], bins=[0, q30, q70, np.inf], labels=['L', 'M', 'H'])
In this case, we utilized .loc, where we specify column labels to retrieve columns instead of using positional indices. Note that you can also use double square brackets [[]] to apply different conditions to the data.
CHD['median_house_value'][CHD['famlev'] == 'M'].mean()
Indeed, you can use np.where to select or search for data in a NumPy array based on specific conditions. It evaluates the conditions and returns the data that satisfy those conditions. This can be a valuable approach for more complex and custom data selection requirements.
CHD_R = CHD[['total_rooms', 'total_bedrooms']]
CHD_R.where(CHD.total_rooms < 1000)
CHD_R.where(CHD.total_rooms < 1000, 0)
con = CHD_R < 1000
CHD_R.where(con, -999)
If you want to select specific elements in data-frame, use .isin(), the following select element where 'famlev=M'
If you want to select specific elements in a DataFrame based on a condition like 'famlev=M', you can use the .isin() method. However, it's worth noting that .isin() is typically used to filter rows rather than individual elements. To filter rows where a specific column matches a certain value, you can do something like this:
import numpy as np
np.where(CHD.loc[:,'famlev'].isin(['M']))
This code filters rows in the DataFrame where the 'famlev' column has the value 'M'.
You can use np.where to create a new column in a DataFrame based on specified conditions. Here's an example of how you can do that:
CHD['size'] = np.where(CHD.total_rooms < 1000, 'small', 'big')
In this example, a new column 'size' is created where the values are determined based on the condition CHD.total_rooms < 1000. If the condition is true, it assigns 'small' to the 'size' column; otherwise, it assigns 'big'. You can adjust the condition and the values as needed for your specific use case.
You can perform simple operations on a DataFrame using list comprehension as well.
CHD['size']=['small' if x<100 else 'big' for x in CHD['total_rooms']]
To remove rows and columns from a DataFrame, you can use the .drop method.
CHD.drop([0,5], axis=0)
CHD.drop('longitude',axis=1, inplace=True)
Note that using the argument inplace=True applies the change to the original data directly.
To replace values in a DataFrame, you can use the df.replace() method.
CHD['famlev'].replace('L','Low').replace('M','Middle').replace('H','High')
You can sort your data by a specific column using:
CHD.sort_values(by='size')
Don't forget to save the data after making changes or sorting it:
CHD.to_csv("/Volumes/F/progwr/python/python_tech/analysis_data_using_python/data/CHD_test.csv",
index=False, encoding='utf8')