Logistic Regression using R: Titanic Case study

Jyoti Dixit

8 years ago

Hi MLEnthusiasts! Today, we will learn how to implement logistic regression using R that too on a well-known dataset, The Titanic Dataset! So, our analysis becomes by getting some information about the dataset, like what all variables are in our dataset and what do we have to predict.

The model gives 80.02% accuracy on train data-set and 78.83% accuracy on test data-set.

The dataset can be found on this link of kaggle. Following are the variables of this dataset:

survival: Tells whether a particular passenger survived or not. 0 for not survived, 1 for survived.
pClass: Ticket class, 1 for 1st class, 2 for 2nd class and 3 for 3rd class.
sex: Tells us the gender of the passenger
Age: in years
sibsp: # of siblings or spouses aboarding the titanic
parch: # of parents/children of the passenger
fare: passenger fare
embarked: The port of embarkment; C for Cherbourg, Q for Queenstown and S for Southampton

Having seen what the data is all about, let’s also understand the problem statement. The problem is to make a logistic regression model using all the variables mentioned above with dependent variable being Survived and other variables being independent. Thus, it will be a predictive model which predicts whether a passenger having given parameters will survive or not. By looking closely at the problem, we can say that it’s a binary classification problem(0/1) or logistic regression problem.

Let us first set our working directory and import our dataset.

setwd("C:/Users/jyoti/Downloads/LogisticRegression")
data <- read.csv("titanic.csv")

Here, data is a dataframe having all the variables and data of those variables. The dataframe has 1257 observations of 9 variables. The next step is to view the data inside the dataframe.

View(data)

Now starts the first main step, “Data Preparation”. To see if there is any missing data or to know about the mean or standard deviation, we use the summary() function.

summary(data)

##      pclass        survived                                    name     
##  Min.   :1.00   Min.   :0.0000   Connolly, Miss. Kate            :   2  
##  1st Qu.:2.00   1st Qu.:0.0000   Kelly, Mr. James                :   2  
##  Median :3.00   Median :0.0000   Abbing, Mr. Anthony             :   1  
##  Mean   :2.31   Mean   :0.3827   Abbott, Master. Eugene Joseph   :   1  
##  3rd Qu.:3.00   3rd Qu.:1.0000   Abbott, Mr. Rossmore Edward     :   1  
##  Max.   :3.00   Max.   :1.0000   Abbott, Mrs. Stanton (Rosa Hunt):   1  
##                                  (Other)                         :1249  
##      sex           age            sibsp           parch       
##  female:452   Min.   : 1.00   Min.   :0.000   Min.   :0.0000  
##  male  :805   1st Qu.:21.00   1st Qu.:0.000   1st Qu.:0.0000  
##               Median :28.00   Median :0.000   Median :0.0000  
##               Mean   :29.07   Mean   :0.502   Mean   :0.3779  
##               3rd Qu.:37.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##               Max.   :60.00   Max.   :8.000   Max.   :9.0000  
##               NA's   :261                                     
##       fare         embarked
##  Min.   :  0.000   C:256   
##  1st Qu.:  7.896   Q:119   
##  Median : 14.400   S:882   
##  Mean   : 32.721           
##  3rd Qu.: 31.000           
##  Max.   :512.329           
##

As can be seen, there are 261 missing values in the Age variable. We need to do missing value imputation in this case. But, before doing that, we need to check how the age distribution looks like so that we can know which imputation method to choose and apply.

hist(data$age)

Since the distribution looks somewhat normal, we can use mean value imputation in this case. That is, we can replace the missing values with the mean of the age. This doesn’t deviate the mean and the distribution of the age remains the same.

data$age[is.na(data$age)] = 29.07
summary(data)

##      pclass        survived                                    name     
##  Min.   :1.00   Min.   :0.0000   Connolly, Miss. Kate            :   2  
##  1st Qu.:2.00   1st Qu.:0.0000   Kelly, Mr. James                :   2  
##  Median :3.00   Median :0.0000   Abbing, Mr. Anthony             :   1  
##  Mean   :2.31   Mean   :0.3827   Abbott, Master. Eugene Joseph   :   1  
##  3rd Qu.:3.00   3rd Qu.:1.0000   Abbott, Mr. Rossmore Edward     :   1  
##  Max.   :3.00   Max.   :1.0000   Abbott, Mrs. Stanton (Rosa Hunt):   1  
##                                  (Other)                         :1249  
##      sex           age            sibsp           parch       
##  female:452   Min.   : 1.00   Min.   :0.000   Min.   :0.0000  
##  male  :805   1st Qu.:22.00   1st Qu.:0.000   1st Qu.:0.0000  
##               Median :29.07   Median :0.000   Median :0.0000  
##               Mean   :29.07   Mean   :0.502   Mean   :0.3779  
##               3rd Qu.:34.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##               Max.   :60.00   Max.   :8.000   Max.   :9.0000  
##                                                               
##       fare         embarked
##  Min.   :  0.000   C:256   
##  1st Qu.:  7.896   Q:119   
##  Median : 14.400   S:882   
##  Mean   : 32.721           
##  3rd Qu.: 31.000           
##  Max.   :512.329           
##

As can be seen above, age doesn’t have any missing value now. Let’s see how the data looks like now.

head(data)

Now, let us understand the concept of dummy variables. Suppose a variable “A” has n classes. This variable A can be replaced by n-1 variables. If A has i, j, k, …, classes, then A_i = 1 in the rows at which i appears in A’s column and 0 for the rest of the rows. Same applies for j, k.. etc. The last value gets taken care of by the intercept. So, let’s introduce dummy variables inside our data for sex and embarked columns since they are holding the categorical data.

data$female = ifelse(data$sex=="female", 1, 0)
data$embarked_c = ifelse(data$embarked=="C", 1, 0)
data$embarked_s = ifelse(data$embarked=="S", 1, 0)
head(data)

Now, if you have a look at dataframe, it contains 12 variables and not 9. The next step is to remove those variables which we no longer need, name, sex since it is already taken into account by female variable, embarked, i.e. column number 3, 4 and 9.

data = data[-c(3, 4, 9)]
head(data)

Let’s do univariate analysis of the numerical variables, age and fare now.

bx = boxplot(data$age)

Thus, there are outliers in the age variable and we need to do outlier handling in this case.

bx$stats

##       [,1]
## [1,]  4.00
## [2,] 22.00
## [3,] 29.07
## [4,] 34.00
## [5,] 52.00

quantile(data$age, seq(0, 1, 0.02))

##      0%      2%      4%      6%      8%     10%     12%     14%     16% 
##  1.0000  3.0000  7.0000 11.0000 15.0000 17.0000 18.0000 18.0000 19.0000 
##     18%     20%     22%     24%     26%     28%     30%     32%     34% 
## 20.0000 21.0000 21.3200 22.0000 23.0000 24.0000 24.0000 25.0000 25.0000 
##     36%     38%     40%     42%     44%     46%     48%     50%     52% 
## 26.0000 27.0000 28.0000 29.0000 29.0448 29.0700 29.0700 29.0700 29.0700 
##     54%     56%     58%     60%     62%     64%     66%     68%     70% 
## 29.0700 29.0700 29.0700 29.0700 29.0700 29.0700 30.0000 30.5000 32.0000 
##     72%     74%     76%     78%     80%     82%     84%     86%     88% 
## 32.5000 34.0000 35.0000 36.0000 37.0000 39.0000 40.0000 42.0000 44.0000 
##     90%     92%     94%     96%     98%    100% 
## 45.0000 47.0000 50.0000 52.0000 55.4400 60.0000

We can replace the outliers above 96% of the quantile range and below 4% of the quantile range so that more accuracy is obtained and the data loss is also not very significant.

data$age = ifelse(data$age>=52, 52, data$age)
data$age = ifelse(data$age<=4, 4, data$age)
boxplot(data$age)

The boxplot comes out to be neat in this case after outlier handling. Let us now do analysis for fare variable.

bx = boxplot(data$fare)

bx$stats

##         [,1]
## [1,]  0.0000
## [2,]  7.8958
## [3,] 14.4000
## [4,] 31.0000
## [5,] 65.0000

Thus, there is a very large amount of outlier data on the upper end.

quantile(data$fare, seq(0, 1, 0.02))

##         0%         2%         4%         6%         8%        10% 
##   0.000000   6.495800   7.125000   7.229200   7.250000   7.550000 
##        12%        14%        16%        18%        20%        22% 
##   7.740524   7.750000   7.750000   7.775000   7.854200   7.879200 
##        24%        26%        28%        30%        32%        34% 
##   7.895800   7.895800   7.925000   8.050000   8.050000   8.662500 
##        36%        38%        40%        42%        44%        46% 
##   9.034672   9.838676  10.500000  11.760000  13.000000  13.000000 
##        48%        50%        52%        54%        56%        58% 
##  13.000000  14.400000  15.045800  15.550000  16.100000  20.212500 
##        60%        62%        64%        66%        68%        70% 
##  21.045000  23.250000  25.895672  26.000000  26.020000  26.550000 
##        72%        74%        76%        78%        80%        82% 
##  27.900000  30.000000  31.387500  36.350000  39.687500  50.456136 
##        84%        86%        88%        90%        92%        94% 
##  53.176000  59.400000  69.550000  76.952520  83.158300 108.009000 
##        96%        98%       100% 
## 136.504184 211.500000 512.329200

As can be seen above, the major difference between the values arises above 96% of the quantile.

data$fare = ifelse(data$fare>=136, 136, data$fare)
boxplot(data$fare)

Let us now start our bivariate analysis.

library(car)

## Loading required package: carData

scatterplot(data$age, data$survived)

It is to be noted that children and old passengers were saved first during the titanic mishap.

scatterplot(data$fare, data$survived)

Let’s now divide the data into train and test sets.

set.seed(222)
t= sample(1:nrow(data), 0.7*nrow(data))
train = data[t,]
test = data[-t,]

Here, we have taken the train to test ratio as 7:3. Let’s now make a model and check for multi-collinearity using variance inflation factor technique.

library(car)
model <- lm(survived~., data=train)
t = vif(model)
sort(t, decreasing=TRUE)

## embarked_c embarked_s       fare     pclass      parch      sibsp 
##   2.813592   2.687370   2.461220   2.372068   1.325070   1.321560 
##        age     female 
##   1.196365   1.120959

As you can see, all the values of VIF for all the variables are less than 5, we need not reject any varible and we can straight away start our analysis.

model1<- glm(as.factor(survived)~., family="binomial", data=train)
summary(model1)

## 
## Call:
## glm(formula = as.factor(survived) ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4459  -0.6413  -0.4295   0.6550   2.5902  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.711050   0.685539   2.496  0.01256 *  
## pclass      -0.964057   0.162430  -5.935 2.93e-09 ***
## age         -0.027928   0.008984  -3.109  0.00188 ** 
## sibsp       -0.295096   0.113817  -2.593  0.00952 ** 
## parch       -0.107448   0.112876  -0.952  0.34114    
## fare         0.000588   0.004208   0.140  0.88887    
## female       2.667516   0.198462  13.441  < 2e-16 ***
## embarked_c   0.511851   0.361509   1.416  0.15681    
## embarked_s  -0.348231   0.313073  -1.112  0.26601    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1162.39  on 878  degrees of freedom
## Residual deviance:  789.76  on 870  degrees of freedom
## AIC: 807.76
## 
## Number of Fisher Scoring iterations: 5

As you can see, for some variables like parch, embarked_c and embarked_s, the P value is greater than 0.05. Thus, here we cannot reject null hypothesis that there is no relation between survived and them. Thus, we need to accept the null hypothesis and discard these three variables from our analysis. Well, step function does it all for us.

stepmodel = step(model1, direction="both")

## Start:  AIC=807.76
## as.factor(survived) ~ pclass + age + sibsp + parch + fare + female + 
##     embarked_c + embarked_s
## 
##              Df Deviance     AIC
## - fare        1   789.77  805.77
## - parch       1   790.69  806.69
## - embarked_s  1   790.98  806.98
##             789.76  807.76
## - embarked_c  1   791.78  807.78
## - sibsp       1   797.53  813.53
## - age         1   799.66  815.66
## - pclass      1   826.11  842.11
## - female      1  1014.99 1030.99
## 
## Step:  AIC=805.77
## as.factor(survived) ~ pclass + age + sibsp + parch + female + 
##     embarked_c + embarked_s
## 
##              Df Deviance     AIC
## - parch       1   790.72  804.72
## - embarked_s  1   791.05  805.05
##             789.77  805.77
## - embarked_c  1   791.85  805.85
## + fare        1   789.76  807.76
## - sibsp       1   798.04  812.04
## - age         1   799.72  813.72
## - pclass      1   858.03  872.03
## - female      1  1016.05 1030.05
## 
## Step:  AIC=804.72
## as.factor(survived) ~ pclass + age + sibsp + female + embarked_c + 
##     embarked_s
## 
##              Df Deviance     AIC
## - embarked_s  1   792.33  804.33
## - embarked_c  1   792.52  804.52
##             790.72  804.72
## + parch       1   789.77  805.77
## + fare        1   790.69  806.69
## - age         1   800.80  812.80
## - sibsp       1   802.08  814.08
## - pclass      1   859.80  871.80
## - female      1  1022.31 1034.31
## 
## Step:  AIC=804.33
## as.factor(survived) ~ pclass + age + sibsp + female + embarked_c
## 
##              Df Deviance     AIC
##             792.33  804.33
## + embarked_s  1   790.72  804.72
## + parch       1   791.05  805.05
## + fare        1   792.32  806.32
## - age         1   801.86  811.86
## - sibsp       1   805.17  815.17
## - embarked_c  1   806.55  816.55
## - pclass      1   860.48  870.48
## - female      1  1035.40 1045.40

formula(stepmodel)

## as.factor(survived) ~ pclass + age + sibsp + female + embarked_c

summary(stepmodel)

## 
## Call:
## glm(formula = as.factor(survived) ~ pclass + age + sibsp + female + 
##     embarked_c, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4341  -0.6484  -0.4317   0.6472   2.5857  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.340344   0.459300   2.918 0.003520 ** 
## pclass      -0.944959   0.119134  -7.932 2.16e-15 ***
## age         -0.027296   0.008952  -3.049 0.002295 ** 
## sibsp       -0.340258   0.105624  -3.221 0.001276 ** 
## female       2.661388   0.190473  13.973  < 2e-16 ***
## embarked_c   0.835246   0.221760   3.766 0.000166 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1162.39  on 878  degrees of freedom
## Residual deviance:  792.33  on 873  degrees of freedom
## AIC: 804.33
## 
## Number of Fisher Scoring iterations: 5

Thus, now the main formula becomes as.factor(survived) ~ pclass + age + sibsp + female + embarked_c. Now, we can use stepmodel to predict the score for train dataset.

train$score <- predict(stepmodel, newdata = train, type="response")
head(train$score)

## [1] 0.09212068 0.31194682 0.14790780 0.90595397 0.10184237 0.18957629

tail(train$score)

## [1] 0.3448299 0.7656246 0.5763930 0.7224989 0.5082788 0.1043664

These are the probabilities values of whether the corresponding passenger survived or not. Let’s now start the model evaluation.

library(lattice)
library(ggplot2)
library(caret)
library(e1071)
train$prediction <- ifelse(train$score>=0.6, 1, 0)
table(factor(train$prediction), factor(train$survived))

##    
##       0   1
##   0 508 132
##   1  42 197

Thus, accuracy = (TP+TN)/(TP+TN+FP+FN)=(508+197)/(508+197+42+132)=705/879=0.802=80.2%. Now, let’s check the ROC and AUC curves of the model.

library(InformationValue)

## 
## Attaching package: 'InformationValue'

## The following objects are masked from 'package:caret':
## 
##     confusionMatrix, precision, sensitivity, specificity

plotROC(actuals=train$survived, predictedScores=as.numeric(fitted(stepmodel)))

ks_plot(actuals=train$survived, predictedScores=as.numeric(fitted(stepmodel)))

Thus, the model has AUCRC value equal to 0.8456 which implies that the model quality is very good. Now predict the scores on the test data.

test$score<-predict(stepmodel, test, type = "response")
head(test$score)

## [1] 0.3178353 0.9622005 0.9534058 0.5616826 0.2779192 0.9396649

Let’s set the threshold to 0.6.

test$predicted<-ifelse(test$score>=0.6, 1, 0)
head(test$predicted)

## [1] 0 1 1 0 0 1

Let’s now have a look at the confusion matrix for test dataset.

table(factor(test$predicted), factor(test$survived))

##    
##       0   1
##   0 209  63
##   1  17  89

Thus, accuracy = 298/378 = 78.83%. Thus the model gives 80.02% accuracy on train data-set and 78.83% accuracy on test data-set.

Share this post: