14 October 2019

Data

To show how fit the multiple regression using R and Python, we consider the car data [car] which has the car specifications; HwyMPG, Model, etc,. We fit different regreesion models to predict Highway MPG using the car specifications.

Contents

Import data

Using R

car_data=read.csv2("https://raw.githubusercontent.com/saeidamiri1/saeidamiri1.github.io/master/public/data/cardata.csv")
names(car_data)
head(car_data)

Using Python

import pandas as pd
car_data = pd.read_csv("https://raw.githubusercontent.com/saeidamiri1/saeidamiri1.github.io/master/public/data/cardata.csv", delimiter=";", decimal=",")
car_data.columns
car_data.head(10)

Preprocessing data

Select the numerical variables to fit the simple regression model. The data have the missing values, so we run an imputuion procedure to fill the missiong values.

Using R

# Select numerical variables
v0=as.numeric(which(unlist(lapply(car_data, is.numeric))))
# Check there is any missing values 
car_data=car_data[,v0]
anyNA(car_data)
apply(is.na(car_data),2,sum)
# Run Imputation procedure
library(mice)
m_car_data<-mice(car_data,print=FALSE)
m_car_data<-complete(m_car_data)
anyNA(m_car_data$imp)

Using Python

# Select numerical variables
car_data=car_data.select_dtypes(exclude=['object'])
# Check there is any missing values 
car_data.isnull().any().any()
car_data.apply(lambda x: x.isnull().any().any(), axis=0)
# Run Imputation procedure
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(missing_values=np.NaN)
imp.fit(car_data)
m_car_data=pd.DataFrame(imp.transform(car_data))
m_car_data.columns=car_data.columns

Select variables

To select variable, calculate the correlation

Using R

cor(m_car_data,m_car_data$HwyMPG,)
cor(m_car_data$HwyMPG,m_car_data$GasTank)
cor(m_car_data$HwyMPG,m_car_data$Rev)

plot(car_data[,c('HwyMPG','GasTank','Rev')])

Using Python

m_car_data.corr().HwyMPG
m_car_data.loc[:,['HwyMPG','Length']].corr()
m_car_data.loc[:,['HwyMPG','Width']].corr()

import seaborn as sns
import matplotlib.pyplot as plt
aa=pd.plotting.scatter_matrix(m_car_data.loc[:,['HwyMPG','GasTank','Rev']])
plt.show()

Fit model

Using R

model_r=lm(HwyMPG ~ Rev, data=m_car_data)
summary(model_r)$r.squared

model_g=lm(HwyMPG ~ GasTank, data=m_car_data)
summary(model_g)$r.squared

model_rg=lm(HwyMPG ~ Rev+GasTank, data=m_car_data)
summary(model_rg)$r.squared

Using Python

from sklearn.linear_model import LinearRegression
lr = LinearRegression(normalize=True)
model_r=lr.fit(m_car_data.loc[:,['Rev']], m_car_data.HwyMPG)
model_r_pred = model_r.predict(m_car_data.loc[:,['Rev']])
from sklearn.metrics import r2_score
r2_score(m_car_data.HwyMPG, model_r_pred)

model_g=lr.fit(m_car_data.loc[:,['GasTank']], m_car_data.HwyMPG)
model_g_pred= model_g.predict(m_car_data.loc[:,['GasTank']])
r2_score(m_car_data.HwyMPG,model_g_pred)

model_rg=lr.fit(m_car_data.loc[:,['Rev','GasTank']], m_car_data.HwyMPG)
model_rg_pred = model_rg.predict(m_car_data.loc[:,['Rev','GasTank']])
r2_score(m_car_data.HwyMPG, model_rg_pred )

Run of codes

⬆ back to top

References

[car] Consumer Reports: The 1993 Cars - Annual Auto Issue (April 1993), Yonkers, NY: Consumers Union.

License

Copyright (c) 2019 Saeid Amiri