Logistic Regression using R: Introduction
Hi MLEnthusiasts! Today, we will learn how to implement logistic regression using R that too on a well-known dataset, The Titanic Dataset! So, our analysis becomes by getting some information about the dataset, like what all variables are in our dataset and what do we have to predict.
What to expect at the end?
The model gives 80.02% accuracy on train data-set and 78.83% accuracy on test data-set.
The data set
The dataset can be found on this link of kaggle. Following are the variables of this dataset:
- survival: Tells whether a particular passenger survived or not. 0 for not survived, 1 for survived.
- pClass: Ticket class, 1 for 1st class, 2 for 2nd class and 3 for 3rd class.
- sex: Tells us the gender of the passenger
- Age: in years
- sibsp: # of siblings or spouses aboarding the titanic
- parch: # of parents/children of the passenger
- fare: passenger fare
- embarked: The port of embarkment; C for Cherbourg, Q for Queenstown and S for Southampton
Logistic Regression using R: Problem Statement
Having seen what the data is all about, let’s also understand the problem statement. The problem is to make a logistic regression model using all the variables mentioned above with dependent variable being Survived and other variables being independent. Thus, it will be a predictive model which predicts whether a passenger having given parameters will survive or not. By looking closely at the problem, we can say that it’s a binary classification problem(0/1) or logistic regression problem.
Reading, viewing and summarizing the data
Let us first set our working directory and import our dataset.
setwd("C:/Users/jyoti/Downloads/LogisticRegression")
data <- read.csv("titanic.csv")
Here, data is a dataframe having all the variables and data of those variables. The dataframe has 1257 observations of 9 variables. The next step is to view the data inside the dataframe.
View(data)
Now starts the first main step, “Data Preparation”. To see if there is any missing data or to know about the mean or standard deviation, we use the summary() function.
summary(data)
## pclass survived name
## Min. :1.00 Min. :0.0000 Connolly, Miss. Kate : 2
## 1st Qu.:2.00 1st Qu.:0.0000 Kelly, Mr. James : 2
## Median :3.00 Median :0.0000 Abbing, Mr. Anthony : 1
## Mean :2.31 Mean :0.3827 Abbott, Master. Eugene Joseph : 1
## 3rd Qu.:3.00 3rd Qu.:1.0000 Abbott, Mr. Rossmore Edward : 1
## Max. :3.00 Max. :1.0000 Abbott, Mrs. Stanton (Rosa Hunt): 1
## (Other) :1249
## sex age sibsp parch
## female:452 Min. : 1.00 Min. :0.000 Min. :0.0000
## male :805 1st Qu.:21.00 1st Qu.:0.000 1st Qu.:0.0000
## Median :28.00 Median :0.000 Median :0.0000
## Mean :29.07 Mean :0.502 Mean :0.3779
## 3rd Qu.:37.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :60.00 Max. :8.000 Max. :9.0000
## NA's :261
## fare embarked
## Min. : 0.000 C:256
## 1st Qu.: 7.896 Q:119
## Median : 14.400 S:882
## Mean : 32.721
## 3rd Qu.: 31.000
## Max. :512.329
##
Using histograms for missing value imputation
As can be seen, there are 261 missing values in the Age variable. We need to do missing value imputation in this case. But, before doing that, we need to check how the age distribution looks like so that we can know which imputation method to choose and apply.
hist(data$age)
Since the distribution looks somewhat normal, we can use mean value imputation in this case. That is, we can replace the missing values with the mean of the age. This doesn’t deviate the mean and the distribution of the age remains the same.
data$age[is.na(data$age)] = 29.07
summary(data)
## pclass survived name
## Min. :1.00 Min. :0.0000 Connolly, Miss. Kate : 2
## 1st Qu.:2.00 1st Qu.:0.0000 Kelly, Mr. James : 2
## Median :3.00 Median :0.0000 Abbing, Mr. Anthony : 1
## Mean :2.31 Mean :0.3827 Abbott, Master. Eugene Joseph : 1
## 3rd Qu.:3.00 3rd Qu.:1.0000 Abbott, Mr. Rossmore Edward : 1
## Max. :3.00 Max. :1.0000 Abbott, Mrs. Stanton (Rosa Hunt): 1
## (Other) :1249
## sex age sibsp parch
## female:452 Min. : 1.00 Min. :0.000 Min. :0.0000
## male :805 1st Qu.:22.00 1st Qu.:0.000 1st Qu.:0.0000
## Median :29.07 Median :0.000 Median :0.0000
## Mean :29.07 Mean :0.502 Mean :0.3779
## 3rd Qu.:34.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :60.00 Max. :8.000 Max. :9.0000
##
## fare embarked
## Min. : 0.000 C:256
## 1st Qu.: 7.896 Q:119
## Median : 14.400 S:882
## Mean : 32.721
## 3rd Qu.: 31.000
## Max. :512.329
##
As can be seen above, age doesn’t have any missing value now. Let’s see how the data looks like now.
head(data)
Univariate analysis
Now, let us understand the concept of dummy variables. Suppose a variable “A” has n classes. This variable A can be replaced by n-1 variables. If A has i, j, k, …, classes, then A_i = 1 in the rows at which i appears in A’s column and 0 for the rest of the rows. Same applies for j, k.. etc. The last value gets taken care of by the intercept. So, let’s introduce dummy variables inside our data for sex and embarked columns since they are holding the categorical data.
data$female = ifelse(data$sex=="female", 1, 0)
data$embarked_c = ifelse(data$embarked=="C", 1, 0)
data$embarked_s = ifelse(data$embarked=="S", 1, 0)
head(data)
Now, if you have a look at dataframe, it contains 12 variables and not 9. The next step is to remove those variables which we no longer need, name, sex since it is already taken into account by female variable, embarked, i.e. column number 3, 4 and 9.
data = data[-c(3, 4, 9)]
head(data)
Outlier manipulation
Let’s do univariate analysis of the numerical variables, age and fare now.
bx = boxplot(data$age)
Thus, there are outliers in the age variable and we need to do outlier handling in this case.
bx$stats
## [,1]
## [1,] 4.00
## [2,] 22.00
## [3,] 29.07
## [4,] 34.00
## [5,] 52.00
quantile(data$age, seq(0, 1, 0.02))
## 0% 2% 4% 6% 8% 10% 12% 14% 16%
## 1.0000 3.0000 7.0000 11.0000 15.0000 17.0000 18.0000 18.0000 19.0000
## 18% 20% 22% 24% 26% 28% 30% 32% 34%
## 20.0000 21.0000 21.3200 22.0000 23.0000 24.0000 24.0000 25.0000 25.0000
## 36% 38% 40% 42% 44% 46% 48% 50% 52%
## 26.0000 27.0000 28.0000 29.0000 29.0448 29.0700 29.0700 29.0700 29.0700
## 54% 56% 58% 60% 62% 64% 66% 68% 70%
## 29.0700 29.0700 29.0700 29.0700 29.0700 29.0700 30.0000 30.5000 32.0000
## 72% 74% 76% 78% 80% 82% 84% 86% 88%
## 32.5000 34.0000 35.0000 36.0000 37.0000 39.0000 40.0000 42.0000 44.0000
## 90% 92% 94% 96% 98% 100%
## 45.0000 47.0000 50.0000 52.0000 55.4400 60.0000
We can replace the outliers above 96% of the quantile range and below 4% of the quantile range so that more accuracy is obtained and the data loss is also not very significant.
data$age = ifelse(data$age>=52, 52, data$age)
data$age = ifelse(data$age<=4, 4, data$age)
boxplot(data$age)
The boxplot comes out to be neat in this case after outlier handling. Let us now do analysis for fare variable.
bx = boxplot(data$fare)
bx$stats
## [,1]
## [1,] 0.0000
## [2,] 7.8958
## [3,] 14.4000
## [4,] 31.0000
## [5,] 65.0000
Thus, there is a very large amount of outlier data on the upper end.
quantile(data$fare, seq(0, 1, 0.02))
## 0% 2% 4% 6% 8% 10%
## 0.000000 6.495800 7.125000 7.229200 7.250000 7.550000
## 12% 14% 16% 18% 20% 22%
## 7.740524 7.750000 7.750000 7.775000 7.854200 7.879200
## 24% 26% 28% 30% 32% 34%
## 7.895800 7.895800 7.925000 8.050000 8.050000 8.662500
## 36% 38% 40% 42% 44% 46%
## 9.034672 9.838676 10.500000 11.760000 13.000000 13.000000
## 48% 50% 52% 54% 56% 58%
## 13.000000 14.400000 15.045800 15.550000 16.100000 20.212500
## 60% 62% 64% 66% 68% 70%
## 21.045000 23.250000 25.895672 26.000000 26.020000 26.550000
## 72% 74% 76% 78% 80% 82%
## 27.900000 30.000000 31.387500 36.350000 39.687500 50.456136
## 84% 86% 88% 90% 92% 94%
## 53.176000 59.400000 69.550000 76.952520 83.158300 108.009000
## 96% 98% 100%
## 136.504184 211.500000 512.329200
As can be seen above, the major difference between the values arises above 96% of the quantile.
data$fare = ifelse(data$fare>=136, 136, data$fare)
boxplot(data$fare)
Bivariate analysis
Let us now start our bivariate analysis.
library(car)
## Loading required package: carData
scatterplot(data$age, data$survived)
It is to be noted that children and old passengers were saved first during the titanic mishap.
scatterplot(data$fare, data$survived)
Modeling Logistic regression using R
Let’s now divide the data into train and test sets.
set.seed(222)
t= sample(1:nrow(data), 0.7*nrow(data))
train = data[t,]
test = data[-t,]
Here, we have taken the train to test ratio as 7:3. Let’s now make a model and check for multi-collinearity using variance inflation factor technique.
library(car)
model <- lm(survived~., data=train)
t = vif(model)
sort(t, decreasing=TRUE)
## embarked_c embarked_s fare pclass parch sibsp
## 2.813592 2.687370 2.461220 2.372068 1.325070 1.321560
## age female
## 1.196365 1.120959
As you can see, all the values of VIF for all the variables are less than 5, we need not reject any varible and we can straight away start our analysis.
model1<- glm(as.factor(survived)~., family="binomial", data=train)
summary(model1)
##
## Call:
## glm(formula = as.factor(survived) ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4459 -0.6413 -0.4295 0.6550 2.5902
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.711050 0.685539 2.496 0.01256 *
## pclass -0.964057 0.162430 -5.935 2.93e-09 ***
## age -0.027928 0.008984 -3.109 0.00188 **
## sibsp -0.295096 0.113817 -2.593 0.00952 **
## parch -0.107448 0.112876 -0.952 0.34114
## fare 0.000588 0.004208 0.140 0.88887
## female 2.667516 0.198462 13.441 < 2e-16 ***
## embarked_c 0.511851 0.361509 1.416 0.15681
## embarked_s -0.348231 0.313073 -1.112 0.26601
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1162.39 on 878 degrees of freedom
## Residual deviance: 789.76 on 870 degrees of freedom
## AIC: 807.76
##
## Number of Fisher Scoring iterations: 5
As you can see, for some variables like parch, embarked_c and embarked_s, the P value is greater than 0.05. Thus, here we cannot reject null hypothesis that there is no relation between survived and them. Thus, we need to accept the null hypothesis and discard these three variables from our analysis. Well, step function does it all for us.
stepmodel = step(model1, direction="both")
## Start: AIC=807.76
## as.factor(survived) ~ pclass + age + sibsp + parch + fare + female +
## embarked_c + embarked_s
##
## Df Deviance AIC
## - fare 1 789.77 805.77
## - parch 1 790.69 806.69
## - embarked_s 1 790.98 806.98
## 789.76 807.76
## - embarked_c 1 791.78 807.78
## - sibsp 1 797.53 813.53
## - age 1 799.66 815.66
## - pclass 1 826.11 842.11
## - female 1 1014.99 1030.99
##
## Step: AIC=805.77
## as.factor(survived) ~ pclass + age + sibsp + parch + female +
## embarked_c + embarked_s
##
## Df Deviance AIC
## - parch 1 790.72 804.72
## - embarked_s 1 791.05 805.05
## 789.77 805.77
## - embarked_c 1 791.85 805.85
## + fare 1 789.76 807.76
## - sibsp 1 798.04 812.04
## - age 1 799.72 813.72
## - pclass 1 858.03 872.03
## - female 1 1016.05 1030.05
##
## Step: AIC=804.72
## as.factor(survived) ~ pclass + age + sibsp + female + embarked_c +
## embarked_s
##
## Df Deviance AIC
## - embarked_s 1 792.33 804.33
## - embarked_c 1 792.52 804.52
## 790.72 804.72
## + parch 1 789.77 805.77
## + fare 1 790.69 806.69
## - age 1 800.80 812.80
## - sibsp 1 802.08 814.08
## - pclass 1 859.80 871.80
## - female 1 1022.31 1034.31
##
## Step: AIC=804.33
## as.factor(survived) ~ pclass + age + sibsp + female + embarked_c
##
## Df Deviance AIC
## 792.33 804.33
## + embarked_s 1 790.72 804.72
## + parch 1 791.05 805.05
## + fare 1 792.32 806.32
## - age 1 801.86 811.86
## - sibsp 1 805.17 815.17
## - embarked_c 1 806.55 816.55
## - pclass 1 860.48 870.48
## - female 1 1035.40 1045.40
formula(stepmodel)
## as.factor(survived) ~ pclass + age + sibsp + female + embarked_c
summary(stepmodel)
##
## Call:
## glm(formula = as.factor(survived) ~ pclass + age + sibsp + female +
## embarked_c, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4341 -0.6484 -0.4317 0.6472 2.5857
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.340344 0.459300 2.918 0.003520 **
## pclass -0.944959 0.119134 -7.932 2.16e-15 ***
## age -0.027296 0.008952 -3.049 0.002295 **
## sibsp -0.340258 0.105624 -3.221 0.001276 **
## female 2.661388 0.190473 13.973 < 2e-16 ***
## embarked_c 0.835246 0.221760 3.766 0.000166 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1162.39 on 878 degrees of freedom
## Residual deviance: 792.33 on 873 degrees of freedom
## AIC: 804.33
##
## Number of Fisher Scoring iterations: 5
Thus, now the main formula becomes as.factor(survived) ~ pclass + age + sibsp + female + embarked_c. Now, we can use stepmodel to predict the score for train dataset.
train$score <- predict(stepmodel, newdata = train, type="response")
head(train$score)
## [1] 0.09212068 0.31194682 0.14790780 0.90595397 0.10184237 0.18957629
tail(train$score)
## [1] 0.3448299 0.7656246 0.5763930 0.7224989 0.5082788 0.1043664
These are the probabilities values of whether the corresponding passenger survived or not.
Model evaluation for predictions
Let’s now start the model evaluation.
library(lattice)
library(ggplot2)
library(caret)
library(e1071)
train$prediction <- ifelse(train$score>=0.6, 1, 0)
table(factor(train$prediction), factor(train$survived))
##
## 0 1
## 0 508 132
## 1 42 197
Thus, accuracy = (TP+TN)/(TP+TN+FP+FN)=(508+197)/(508+197+42+132)=705/879=0.802=80.2%. Now, let’s check the ROC and AUC curves of the model.
library(InformationValue)
##
## Attaching package: 'InformationValue'
## The following objects are masked from 'package:caret':
##
## confusionMatrix, precision, sensitivity, specificity
plotROC(actuals=train$survived, predictedScores=as.numeric(fitted(stepmodel)))
ks_plot(actuals=train$survived, predictedScores=as.numeric(fitted(stepmodel)))
Thus, the model has AUCRC value equal to 0.8456 which implies that the model quality is very good.
Predictions on test data and calculating accuracy
Now predict the scores on the test data.
test$score<-predict(stepmodel, test, type = "response")
head(test$score)
## [1] 0.3178353 0.9622005 0.9534058 0.5616826 0.2779192 0.9396649
Let’s set the threshold to 0.6.
test$predicted<-ifelse(test$score>=0.6, 1, 0)
head(test$predicted)
## [1] 0 1 1 0 0 1
Let’s now have a look at the confusion matrix for test dataset.
table(factor(test$predicted), factor(test$survived))
##
## 0 1
## 0 209 63
## 1 17 89
Thus, accuracy = 298/378 = 78.83%. Thus the model gives 80.02% accuracy on train data-set and 78.83% accuracy on test data-set.
Good Job done Jyoti..
Thanks Pratik. 🙂
Mam will you teach me..my background is mechanical and I m learning data science
Sure Nishkarsh. Mail us at info@mlforanalytics.com with what you want to learn and we can have a discussion thereafter.