Matplotlib¶
Python offers sophisticated plotting capabilities. Plotting can be approached in two ways:
- Pythonic Approach: In this method, an empty object is created, and plots are constructed programmatically, then assigned to the empty object using code.
- Non-Pythonic Approach: This approach relies on external libraries like
matplotlib, which provides user-friendly tools for interactive plotting. A common shorthand for importing this module isimport matplotlib.pyplot as plt.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
CHD=pd.read_csv('../data/CHD_test.csv',index_col=False)
CHD.head()
| latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | famlev | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 34.19 | 15 | 5612 | 1283 | 1015 | 472 | 1.4936 | 66900 | L |
| 1 | 34.40 | 19 | 7650 | 1901 | 1129 | 463 | 1.8200 | 80100 | L |
| 2 | 33.69 | 17 | 720 | 174 | 333 | 117 | 1.6509 | 85700 | L |
| 3 | 33.64 | 14 | 1501 | 337 | 515 | 226 | 3.1917 | 73400 | M |
| 4 | 33.57 | 20 | 1454 | 326 | 624 | 262 | 1.9250 | 65500 | L |
Scatter plot¶
The most commonly used plot is the scatter plot. Here are the following scripts that generate random numbers and create a scatter plot:
x = CHD.median_house_value
y = CHD.median_income
plt.scatter(x, y)
plt.xlabel('median_house_value')
plt.ylabel('median_income')
plt.title('Scatter Plot')
plt.show(block=False)
Indeed, you can customize the appearance of a scatter plot using various arguments:
- Size of point (
s): you can adjust the size of the points, for example,s=10make the point smaller, - colour (
c): You can specify the color of pointy. For example,c=redwill make the points red. marker: you can choose different marker styles for points. For examplemarker=sw ill use squares for presenting points.
plt.scatter(x, y,s=40, c='red', marker='s')
plt.xlabel('median_house_value')
plt.ylabel('median_income')
plt.title('Scatter Plot')
plt.show(block=False)
In this example, we've customized the scatter plot to use red square for data points with a larger size (s=40) to label the points. You can further explore the links provided for more marker styles and line properties in matplotlib. The following figure displays a more sophisticated plot.
select = (CHD.famlev == 'L')
plt.scatter(x, y, alpha=0.3)
plt.xlabel('median_house_value')
plt.ylabel('median_income')
plt.title('Scatter Plot')
plt.scatter(x[select], y[select], facecolor='none', edgecolors='r')
plt.show(block=False)
Fit a linear model to a sample dataset.
fig, ax = plt.subplots()
ax.scatter(x, y, alpha=0.5, color='orchid')
fig.suptitle('Scatter Plot')
fig.tight_layout(pad=2);
ax.grid(True)
fit = np.polyfit(x, y, deg=1)
ax.plot(x, fit[0]*x + fit[1], '-',color='red', linewidth=2)
[<matplotlib.lines.Line2D at 0x14052de10>]
subplot¶
You can create multiple subplots in a single figure using the .subplot(#row,#col,position) method.
plt.subplot(2, 1, 1)
plt.scatter(x, y)
plt.title("Fig1")
plt.xlabel("median house value")
plt.ylabel("median income")
plt.subplot(2, 1, 2)
plt.scatter(x, CHD.population)
plt.title("Fig2")
plt.xlabel("median house value")
plt.ylabel("population")
plt.show(block=False)
Seaborn¶
Seaborn provides advanced graphical capabilities for creating sophisticated statistical visualizations with ease. It simplifies the process of generating complex plots from pandas DataFrames using simple commands.
import seaborn as sns
sns.set(color_codes=True)
CHD['median_income'] = (CHD['median_income'] -CHD['median_income'].mean()) / CHD['median_income'].std()
CHD['median_house_value'] = (CHD['median_house_value'] -CHD['median_house_value'].mean()) / CHD['median_house_value'].std()
for col in ['median_income','median_house_value']:
plt.hist(CHD[col], density=True)
We can get a smooth estimate of the distribution using a kernel density estimation (KDE):
import warnings
warnings.filterwarnings("ignore")
sns.kdeplot(data=CHD, x='median_income', y='median_house_value')
<Axes: xlabel='median_income', ylabel='median_house_value'>
You can create a hexagonally-based histogram using jointplot:
sns.jointplot(data=CHD, x='median_income', y='median_house_value',kind="hex")
<seaborn.axisgrid.JointGrid at 0x168785bd0>
sns.jointplot(data=CHD, x='median_income', y='median_house_value',kind="kde", hue='famlev')
<seaborn.axisgrid.JointGrid at 0x168f77e10>
The following illustrates how to draw a box plot for different family levels.
g=sns.catplot(data=CHD, x='median_income', y='famlev', kind="box")
g.set_axis_labels("Income", "Family level");
Pairplots¶
We can generalize joint plots for multidimensional data, which is very useful for exploring correlations between multiple dimensions of data.
sns.pairplot(CHD, hue='famlev');
--------------------------------------------------------------------------- NameError Traceback (most recent call last) /Users/samamiri/Library/CloudStorage/GoogleDrive-saeid.amiri1@gmail.com/My Drive/python/Python-for-Data-Analysis/notebooks/06-visualization.ipynb Cell 25 line 1 ----> <a href='vscode-notebook-cell:/Users/samamiri/Library/CloudStorage/GoogleDrive-saeid.amiri1%40gmail.com/My%20Drive/python/Python-for-Data-Analysis/notebooks/06-visualization.ipynb#X31sZmlsZQ%3D%3D?line=0'>1</a> sns.pairplot(CHD, hue='famlev'); NameError: name 'sns' is not defined
GGPLOT2¶
ggplot2 is a very useful package in R for creating advanced plots. In Python, the plotnine library is used to create ggplot2-like plots. You can import the module using import plotnine as p9. Generating plots in ggplot2 (plotnine) follows a structured series of steps, which can be accomplished via:
- initialize it
import plotnine as p9
p9.ggplot(data=CHD)
<Figure Size: (640 x 480)>
- Define aesthetics using
aesand specify your arguments. The most important aesthetics include:x,y,alpha,color,colour,fill,linetype,shape,size, andstroke. To create variations of the plot with different parameters, you can assign it to a variable.
CHD_plot=p9.ggplot(data=CHD,mapping=p9.aes(x='median_income', y='median_house_value'))
- Specify what you want to display and use the
+operator to add layers and customize your plot.
CHD_plot+p9.geom_point()
<Figure Size: (640 x 480)>
You can easily add scale and define label:
CHD_plot+ p9.geom_point(alpha=0.15)+ p9.xlab("median_income") + p9.ylab("median_house_value")+ p9.scale_x_log10()+ p9.theme_bw()+ p9.theme(text=p9.element_text(size=10))
<Figure Size: (640 x 480)>
- After creating your plot, you can save it to a file in your favourite format
CHD_plot2 = CHD_plot+p9.geom_point()
CHD_plot2.save("CHD_plot.png", dpi=300)
/home/sam/venv/lib/python3.10/site-packages/plotnine/ggplot.py:587: PlotnineWarning: Saving 6.4 x 4.8 in image. /home/sam/venv/lib/python3.10/site-packages/plotnine/ggplot.py:588: PlotnineWarning: Filename: CHD_plot.png
bar chart¶
To generate a bar chart, you can use geom_bar()
(p9.ggplot(data=CHD,mapping=p9.aes(x='famlev'))+ p9.geom_bar())
<Figure Size: (640 x 480)>
Plotting distributions¶
- A boxplot can be created using
geom_boxplot():
(p9.ggplot(data=CHD,
mapping=p9.aes(x='famlev',
y='median_income'))
+ p9.geom_boxplot()
+ p9.scale_y_log10()
)
<Figure Size: (640 x 480)>
- To add points behind the boxplot, you can use geom_jitter() to plot the points with some random noise to avoid overlapping points. This will create a visual representation of the data points behind the boxplot. Here's an example:
(p9.ggplot(data=CHD,
mapping=p9.aes(x='famlev',
y='median_income'))
+ p9.geom_boxplot()
+ p9.geom_jitter(alpha=0.1, color="green")
+ p9.scale_y_log10()
)
<Figure Size: (640 x 480)>