Logistic Regression
The logistics regression is used when the response variable is a binary variable, such as (0 or 1), (live or die), and (fail or succeed).
Contents
- Mathematical presentation
- data
- Necessary packages
- Import data
- Check data
- Select variables
- Fit model
- Fit directrly
- Fit using Cross-validation
- Model evaluation
Mathematical presentation
The model consideres the probability of event instead the event in model:
which can be converted to
or
Unlike the regular regression, we can not apply mean square error to find the estimates of coefficients, they can be found using Maximum likelihood:
=\prod_{i=1}^{n}\pi_{i}^{y_{i}}(1-\pi_{i})^{1-y_{i}}=\prod_{i=1}^{n}\left(\frac{\pi_{i}}{1-\pi_{i}}\right)^{y_{i}}(1-\pi_{i})) |
where
is the odds for
. The log-likelihood can be obtained as
Data
To show how fit the logistics regression using Python, we consider the [frogs] data, which is about the distribution of the Southern Corroboree frog, which occurs in the Snowy Mountains area of New South Wales, Australia.
Necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score
Import data
frog_data = pd.read_csv("/Volumes/F/for_my_website/saeidamiri1.github.io0/saeidamiri1.github.io/public/general_data/logistic_frog/frogs.csv", delimiter=";", decimal=",")
frog_data.columns
frog_data.head(10)
Check data
Check whether the data has missing value or not?
# Check there is any missing values
frog_data.isnull().any().any()
Select variables
Select the variables and depict the scatter plot to see the relation between them
colors = ["" for x in range(frog_data.shape[0])]
colors=pd.array(colors)
colors[frog_data['pres.abs']==1]=['red']
colors[frog_data['pres.abs']==0]=['blue']
df=frog_data[['meanmin','pres.abs']]
ax=sns.regplot(x='meanmin', y="pres.abs", data=df, fit_reg=True, x_jitter=0.1,y_jitter=0.1, scatter_kws={'alpha': 0.5,'s':20})
ax.legend().set(ylabel='Present of frogs', xlabel='mean minimum Spring temperature')
plt.show()
Fit model
Fit directrly
Logistic regression can be fit directly
x_train, x_test, y_train, y_test = train_test_split(df['meanmin'], df['pres.abs'], test_size=0.3)
x_train=x_train[:,np.newaxis]
x_test=x_test[:,np.newaxis]
logistic_model = LogisticRegression(random_state=0)
logistic_model.fit(x_train, y_train)
y_train_pred=logistic_model.predict(x_train)
y_test_pred=logistic_model.predict(x_test)
logistic_model.coef_
Fit using Cross-validation
We can run cross-validation to find the best estimate
cross_val_model = LogisticRegression(random_state=0)
scores = cross_val_score(cross_val_model, x_train, y_train, cv=10)
print(scores)
print(np.mean(scores))
penalty = ['l2']
C = np.logspace(0, 4, 10)
random_state=[0]
# creating a dictionary of hyperparameters
hyperparameters = dict(C=C, penalty=penalty, random_state=random_state)
Now define the Stochastic Gradient Descent
clf = GridSearchCV(estimator = logistic_model, param_grid = hyperparameters, cv=5)
best_model = clf.fit(x_train, y_train)
print('Best Penalty:', best_model.best_estimator_.get_params()['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])
best_predicted_values = best_model.predict(x_test)
print(best_predicted_values)
accuracy_score(best_predicted_values, y_test.values)
Model evaluation
cfm_train = confusion_matrix(y_train,y_train_pred)
cfm_test = confusion_matrix(y_test,y_test_pred)
print(cfm_train)
print(cfm_test)
print(classification_report(y_train,y_train_pred))
print(classification_report(y_test,y_test_pred))
References
[frogs] Hunter, D. (2000) The conservation and demography of the southern corroboree frog (Pseudophryne corroboree). M.Sc. thesis, University of Canberra, Canberra. [ref] https://github.com/animesh-agarwal/Machine-Learning-Datasets/blob/master/census-data/census%20income%20logistic%20regression.ipynb
License
Copyright (c) 2020 Saeid Amiri