Random Forests Using R: Titanic Case Study

Hi MLEnthusiasts! Today, we will learn how to implement random forests using R that too on a well-known dataset, The Titanic Dataset! So, our analysis becomes by getting some information about the dataset, like what all variables are in our dataset and what do we have to predict. This tutorial will make you know how to follow the guidelines of kaggle and how to make submissions in kaggle. The train and test data files can be found on this link.

The dataset can be found on this link of kaggle. Following are the variables of this dataset: survival: Tells whether a particular passenger survived or not. 0 for not survived, 1 for survived. pClass: Ticket class, 1 for 1st class, 2 for 2nd class and 3 for 3rd class. sex: Tells us the gender of the passenger Age: in years sibsp: # of siblings or spouses aboarding the titanic parch: # of parents/children of the passenger fare: passenger fare embarked: The port of embarkment; C for Cherbourg, Q for Queenstown and S for Southampton

Having seen what the data is all about, let’s also understand the problem statement. The problem is to make a predictive model which predicts whether a passenger having given parameters will survive or not. By looking closely at the problem, we can say that it’s a binary classification problem(0/1) which we will try to solve using random forests.

Let us first set our working directory and import our dataset.

data <- read.csv("C:/Users/jyoti/Downloads/randomForests/train.csv")

Here, data is a dataframe having all the variables and data of those variables. The dataframe has 891 observations of 12 variables. The next step is to view the data inside the dataframe.

View(data)

Now starts the first main step, “Data Preparation”. To see if there is any missing data or to know about the mean or standard deviation, we use the summary() function.

summary(data)

##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186

As can be seen, there are 177 missing values in the Age variable. We need to do missing value imputation in this case. But, before doing that, we need to check how the age distribution looks like so that we can know which imputation method to choose and apply.

hist(data$Age)

Since the distribution looks somewhat normal, we can use mean value imputation in this case. That is, we can replace the missing values with the mean of the age. This doesn’t deviate the mean and the distribution of the age remains the same.

data$Age[is.na(data$Age)] = 29.07
summary(data)

##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:22.00  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :29.07  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.57  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:35.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                               
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186

As can be seen above, age doesn’t have any missing value now. Let’s see how the data looks like now.

head(data)

Now, let us understand the concept of dummy variables. Suppose a variable “A” has n classes. This variable A can be replaced by n-1 variables. If A has i, j, k, …, classes, then A_i = 1 in the rows at which i appears in A’s column and 0 for the rest of the rows. Same applies for j, k.. etc. The last value gets taken care of by the intercept. So, let’s introduce dummy variables inside our data for sex and embarked columns since they are holding the categorical data.

data$female = ifelse(data$Sex=="female", 1, 0)
data$embarked_c = ifelse(data$Embarked=="C", 1, 0)
data$embarked_s = ifelse(data$Embarked=="S", 1, 0)
head(data)

Now, if you have a look at dataframe, it contains 15 variables and not 12. The next step is to remove those variables which we no longer need in the model making, Name, Sex since it is already taken into account by female variable, Ticket, Cabin, embarked, i.e. column number 4, 5, 9, 11 and 12.

PassengerData = data[-c(4, 5, 9, 11, 12)]
head(PassengerData)

Let’s do univariate analysis of the numerical variables, age and fare now.

bx = boxplot(PassengerData$Age)

Thus, there are outliers in the age variable and we need to do outlier handling in this case.

bx$stats

##       [,1]
## [1,]  3.00
## [2,] 22.00
## [3,] 29.07
## [4,] 35.00
## [5,] 54.00

quantile(PassengerData$Age, seq(0, 1, 0.02))

##    0%    2%    4%    6%    8%   10%   12%   14%   16%   18%   20%   22% 
##  0.42  2.00  4.00  8.40 14.00 16.00 17.00 18.00 19.00 19.00 20.00 21.00 
##   24%   26%   28%   30%   32%   34%   36%   38%   40%   42%   44%   46% 
## 22.00 23.00 24.00 24.00 25.00 26.00 27.00 28.00 28.00 29.00 29.07 29.07 
##   48%   50%   52%   54%   56%   58%   60%   62%   64%   66%   68%   70% 
## 29.07 29.07 29.07 29.07 29.07 29.07 29.07 29.07 30.00 30.70 32.00 32.50 
##   72%   74%   76%   78%   80%   82%   84%   86%   88%   90%   92%   94% 
## 34.00 35.00 36.00 36.00 38.00 40.00 41.00 43.00 45.00 47.00 50.00 52.00 
##   96%   98%  100% 
## 56.00 62.00 80.00

We can replace the outliers above 96% of the quantile range and below 4% of the quantile range so that more accuracy is obtained and the data loss is also not very significant.

PassengerData$Age = ifelse(PassengerData$Age>=56, 56, PassengerData$Age)
PassengerData$Age = ifelse(PassengerData$Age<=3, 3, PassengerData$Age)
boxplot(PassengerData$Age)

The boxplot comes out to be neat in this case after outlier handling. Let us now do analysis for fare variable.

bx = boxplot(PassengerData$Fare)

bx$stats

##         [,1]
## [1,]  0.0000
## [2,]  7.9104
## [3,] 14.4542
## [4,] 31.0000
## [5,] 65.0000

Thus, there is a very large amount of outlier data on the upper end.

quantile(PassengerData$Fare, seq(0, 1, 0.02))

##        0%        2%        4%        6%        8%       10%       12% 
##   0.00000   6.39750   7.05252   7.22500   7.25000   7.55000   7.75000 
##       14%       16%       18%       20%       22%       24%       26% 
##   7.75000   7.77500   7.79580   7.85420   7.89580   7.89580   7.92500 
##       28%       30%       32%       34%       36%       38%       40% 
##   8.05000   8.05000   8.10000   8.66250   9.50000  10.47000  10.50000 
##       42%       44%       46%       48%       50%       52%       54% 
##  12.22000  13.00000  13.00000  13.08334  14.45420  15.24580  15.85000 
##       56%       58%       60%       62%       64%       66%       68% 
##  17.88000  20.22000  21.67920  24.15000  26.00000  26.00000  26.30750 
##       70%       72%       74%       76%       78%       80%       82% 
##  27.00000  28.94250  30.32832  31.38750  35.50000  39.68750  49.90084 
##       84%       86%       88%       90%       92%       94%       96% 
##  53.10000  57.39168  69.55000  77.95830  82.17080  93.50000 133.99000 
##       98%      100% 
## 211.33750 512.32920

As can be seen above, the major difference between the values arises above 96% of the quantile.

PassengerData$Fare = ifelse(PassengerData$Fare>=133.99, 133.99, PassengerData$Fare)
boxplot(PassengerData$Fare)

Let us now start our bivariate analysis.

library(car)

## Loading required package: carData

scatterplot(PassengerData$Age, PassengerData$Survived)

It is to be noted that children and old passengers were saved first during the titanic mishap.

scatterplot(PassengerData$Fare, PassengerData$Survived)

Now, let’s make our random forests model.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

require(caret)

## Loading required package: caret

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(e1071)

The next step is to make our random forests model.

modelrf <- randomForest(as.factor(PassengerData$Survived)~., data = PassengerData, do.trace = T)

## ntree      OOB      1      2
##     1:  26.48% 19.50% 38.02%
##     2:  23.88% 16.41% 35.75%
##     3:  22.86% 14.29% 36.29%
##     4:  22.65% 14.10% 36.05%
##     5:  21.84% 13.82% 34.39%
##     6:  22.28% 14.99% 33.54%
##     7:  22.72% 15.68% 33.84%
##     8:  21.31% 12.59% 35.12%
##     9:  21.26% 13.94% 32.94%
##    10:  20.75% 13.63% 32.15%
##    11:  20.79% 13.21% 32.94%
##    12:  20.88% 13.92% 32.06%
##    13:  20.02% 12.77% 31.67%
##    14:  20.22% 12.02% 33.43%
##    15:  20.00% 12.20% 32.55%
##    16:  19.87% 12.39% 31.87%
##    17:  19.42% 12.02% 31.29%
##    18:  18.97% 11.84% 30.41%
##    19:  19.53% 12.93% 30.12%
##    20:  18.97% 12.02% 30.12%
##    21:  18.97% 12.20% 29.82%
##    22:  19.75% 12.93% 30.70%
##    23:  19.53% 12.93% 30.12%
##    24:  18.97% 12.20% 29.82%
##    25:  19.75% 12.93% 30.70%
##    26:  19.19% 12.75% 29.53%
##    27:  19.75% 13.11% 30.41%
##    28:  18.74% 12.20% 29.24%
##    29:  19.19% 12.93% 29.24%
##    30:  19.08% 12.75% 29.24%
##    31:  19.53% 12.57% 30.70%
##    32:  19.42% 12.39% 30.70%
##    33:  19.08% 12.02% 30.41%
##    34:  19.19% 11.84% 30.99%
##    35:  19.53% 12.02% 31.58%
##    36:  19.08% 11.48% 31.29%
##    37:  19.42% 11.84% 31.58%
##    38:  19.08% 11.11% 31.87%
##    39:  19.42% 11.66% 31.87%
##    40:  18.86% 11.11% 31.29%
##    41:  19.42% 11.29% 32.46%
##    42:  19.19% 10.93% 32.46%
##    43:  19.19% 11.11% 32.16%
##    44:  19.08% 11.11% 31.87%
##    45:  18.41% 11.11% 30.12%
##    46:  17.85% 10.20% 30.12%
##    47:  18.29% 10.75% 30.41%
##    48:  18.41% 11.11% 30.12%
##    49:  18.07% 10.75% 29.82%
##    50:  17.96% 10.75% 29.53%
##    51:  17.40% 10.56% 28.36%
##    52:  17.85% 10.75% 29.24%
##    53:  18.07% 10.93% 29.53%
##    54:  18.07% 10.93% 29.53%
##    55:  18.18% 10.93% 29.82%
##    56:  18.63% 11.48% 30.12%
##    57:  18.18% 11.11% 29.53%
##    58:  18.07% 11.29% 28.95%
##    59:  18.18% 11.11% 29.53%
##    60:  18.52% 10.93% 30.70%
##    61:  18.52% 11.11% 30.41%
##    62:  17.85% 10.38% 29.82%
##    63:  18.07% 10.75% 29.82%
##    64:  18.07% 10.56% 30.12%
##    65:  17.73% 10.56% 29.24%
##    66:  18.07% 10.75% 29.82%
##    67:  17.73% 10.56% 29.24%
##    68:  18.07% 10.56% 30.12%
##    69:  18.29% 10.75% 30.41%
##    70:  17.73% 10.56% 29.24%
##    71:  18.07% 10.38% 30.41%
##    72:  18.29% 10.75% 30.41%
##    73:  17.85% 10.93% 28.95%
##    74:  18.52% 11.11% 30.41%
##    75:  18.41% 10.75% 30.70%
##    76:  18.07% 10.93% 29.53%
##    77:  18.63% 11.29% 30.41%
##    78:  18.63% 10.93% 30.99%
##    79:  18.52% 10.93% 30.70%
##    80:  18.52% 11.29% 30.12%
##    81:  18.63% 11.48% 30.12%
##    82:  18.07% 10.75% 29.82%
##    83:  18.41% 10.93% 30.41%
##    84:  18.07% 10.93% 29.53%
##    85:  18.63% 11.29% 30.41%
##    86:  18.86% 11.29% 30.99%
##    87:  19.08% 11.11% 31.87%
##    88:  18.86% 11.11% 31.29%
##    89:  18.52% 10.93% 30.70%
##    90:  18.63% 11.29% 30.41%
##    91:  18.86% 11.48% 30.70%
##    92:  18.74% 11.29% 30.70%
##    93:  18.97% 11.29% 31.29%
##    94:  18.29% 10.56% 30.70%
##    95:  18.18% 11.11% 29.53%
##    96:  18.29% 10.93% 30.12%
##    97:  18.41% 11.11% 30.12%
##    98:  18.52% 10.93% 30.70%
##    99:  18.29% 10.93% 30.12%
##   100:  18.52% 10.93% 30.70%
##   101:  18.18% 10.93% 29.82%
##   102:  18.18% 10.75% 30.12%
##   103:  18.18% 10.75% 30.12%
##   104:  17.85% 10.56% 29.53%
##   105:  17.96% 10.38% 30.12%
##   106:  18.29% 10.93% 30.12%
##   107:  18.29% 11.11% 29.82%
##   108:  18.07% 10.93% 29.53%
##   109:  17.96% 10.56% 29.82%
##   110:  17.96% 10.75% 29.53%
##   111:  18.18% 10.75% 30.12%
##   112:  18.41% 10.93% 30.41%
##   113:  18.07% 10.38% 30.41%
##   114:  18.29% 10.56% 30.70%
##   115:  18.52% 10.75% 30.99%
##   116:  18.41% 10.56% 30.99%
##   117:  18.74% 10.56% 31.87%
##   118:  18.29% 10.56% 30.70%
##   119:  18.63% 10.56% 31.58%
##   120:  18.29% 10.02% 31.58%
##   121:  18.18% 10.20% 30.99%
##   122:  18.41% 10.38% 31.29%
##   123:  18.41% 10.20% 31.58%
##   124:  18.29% 10.20% 31.29%
##   125:  18.07% 10.02% 30.99%
##   126:  18.18% 10.02% 31.29%
##   127:  18.07% 10.02% 30.99%
##   128:  18.18%  9.84% 31.58%
##   129:  17.96%  9.84% 30.99%
##   130:  18.18% 10.02% 31.29%
##   131:  17.96%  9.84% 30.99%
##   132:  17.96%  9.84% 30.99%
##   133:  17.96%  9.84% 30.99%
##   134:  17.73%  9.65% 30.70%
##   135:  17.73%  9.65% 30.70%
##   136:  17.73%  9.84% 30.41%
##   137:  17.73%  9.84% 30.41%
##   138:  17.51%  9.84% 29.82%
##   139:  17.73%  9.84% 30.41%
##   140:  17.28%  9.65% 29.53%
##   141:  17.85%  9.84% 30.70%
##   142:  18.07% 10.20% 30.70%
##   143:  17.85%  9.84% 30.70%
##   144:  17.73%  9.84% 30.41%
##   145:  18.41% 10.38% 31.29%
##   146:  18.07% 10.20% 30.70%
##   147:  17.96% 10.20% 30.41%
##   148:  17.73% 10.20% 29.82%
##   149:  18.07% 10.02% 30.99%
##   150:  17.73%  9.84% 30.41%
##   151:  18.07% 10.02% 30.99%
##   152:  18.18%  9.84% 31.58%
##   153:  18.18%  9.84% 31.58%
##   154:  17.85%  9.65% 30.99%
##   155:  17.85%  9.84% 30.70%
##   156:  17.96%  9.84% 30.99%
##   157:  18.07%  9.84% 31.29%
##   158:  17.85%  9.84% 30.70%
##   159:  17.96%  9.84% 30.99%
##   160:  18.07%  9.84% 31.29%
##   161:  18.07% 10.02% 30.99%
##   162:  17.96%  9.84% 30.99%
##   163:  18.07% 10.02% 30.99%
##   164:  18.07% 10.02% 30.99%
##   165:  18.18% 10.02% 31.29%
##   166:  18.18% 10.02% 31.29%
##   167:  18.07%  9.84% 31.29%
##   168:  18.07% 10.02% 30.99%
##   169:  18.07% 10.02% 30.99%
##   170:  17.96%  9.65% 31.29%
##   171:  17.85%  9.65% 30.99%
##   172:  17.85%  9.65% 30.99%
##   173:  17.85%  9.84% 30.70%
##   174:  17.73%  9.65% 30.70%
##   175:  17.85%  9.84% 30.70%
##   176:  17.62%  9.65% 30.41%
##   177:  17.85%  9.84% 30.70%
##   178:  17.85%  9.84% 30.70%
##   179:  17.85% 10.02% 30.41%
##   180:  17.73%  9.84% 30.41%
##   181:  17.62%  9.84% 30.12%
##   182:  17.73%  9.84% 30.41%
##   183:  17.62%  9.84% 30.12%
##   184:  17.62% 10.02% 29.82%
##   185:  17.85% 10.02% 30.41%
##   186:  17.73%  9.84% 30.41%
##   187:  17.73%  9.84% 30.41%
##   188:  17.62% 10.02% 29.82%
##   189:  17.62% 10.02% 29.82%
##   190:  17.62%  9.65% 30.41%
##   191:  17.62%  9.84% 30.12%
##   192:  17.51%  9.84% 29.82%
##   193:  17.62%  9.84% 30.12%
##   194:  17.40%  9.65% 29.82%
##   195:  17.51%  9.65% 30.12%
##   196:  17.73% 10.20% 29.82%
##   197:  17.62% 10.02% 29.82%
##   198:  17.62% 10.02% 29.82%
##   199:  17.73% 10.02% 30.12%
##   200:  17.62%  9.84% 30.12%
##   201:  17.62%  9.84% 30.12%
##   202:  17.62%  9.84% 30.12%
##   203:  17.62% 10.02% 29.82%
##   204:  17.51% 10.02% 29.53%
##   205:  17.62% 10.02% 29.82%
##   206:  17.51%  9.84% 29.82%
##   207:  17.73% 10.02% 30.12%
##   208:  17.62%  9.84% 30.12%
##   209:  17.28%  9.84% 29.24%
##   210:  17.62%  9.84% 30.12%
##   211:  17.85% 10.02% 30.41%
##   212:  17.73% 10.02% 30.12%
##   213:  17.85% 10.20% 30.12%
##   214:  17.73% 10.02% 30.12%
##   215:  17.51% 10.02% 29.53%
##   216:  17.73% 10.20% 29.82%
##   217:  17.85% 10.20% 30.12%
##   218:  17.51% 10.20% 29.24%
##   219:  17.62% 10.20% 29.53%
##   220:  17.62% 10.20% 29.53%
##   221:  17.85% 10.38% 29.82%
##   222:  17.62% 10.20% 29.53%
##   223:  17.62% 10.20% 29.53%
##   224:  17.51% 10.20% 29.24%
##   225:  17.28% 10.02% 28.95%
##   226:  17.40% 10.02% 29.24%
##   227:  17.17% 10.02% 28.65%
##   228:  17.62% 10.20% 29.53%
##   229:  17.62% 10.02% 29.82%
##   230:  17.62% 10.02% 29.82%
##   231:  17.51% 10.02% 29.53%
##   232:  17.40% 10.02% 29.24%
##   233:  17.28% 10.02% 28.95%
##   234:  17.28% 10.02% 28.95%
##   235:  17.40% 10.02% 29.24%
##   236:  17.40% 10.02% 29.24%
##   237:  17.62% 10.02% 29.82%
##   238:  17.96% 10.02% 30.70%
##   239:  17.85% 10.02% 30.41%
##   240:  17.96% 10.02% 30.70%
##   241:  17.96%  9.84% 30.99%
##   242:  17.85%  9.84% 30.70%
##   243:  18.18% 10.02% 31.29%
##   244:  18.07%  9.84% 31.29%
##   245:  17.96% 10.02% 30.70%
##   246:  17.85%  9.84% 30.70%
##   247:  17.96%  9.84% 30.99%
##   248:  18.07%  9.84% 31.29%
##   249:  17.96% 10.02% 30.70%
##   250:  17.96%  9.84% 30.99%
##   251:  17.96%  9.84% 30.99%
##   252:  17.85%  9.84% 30.70%
##   253:  17.85%  9.84% 30.70%
##   254:  17.62%  9.84% 30.12%
##   255:  17.73%  9.84% 30.41%
##   256:  17.73% 10.02% 30.12%
##   257:  17.73% 10.02% 30.12%
##   258:  17.96% 10.38% 30.12%
##   259:  17.73% 10.02% 30.12%
##   260:  17.85% 10.20% 30.12%
##   261:  18.07% 10.20% 30.70%
##   262:  17.96% 10.38% 30.12%
##   263:  17.96% 10.38% 30.12%
##   264:  17.96% 10.38% 30.12%
##   265:  17.96% 10.38% 30.12%
##   266:  17.96% 10.38% 30.12%
##   267:  17.96% 10.38% 30.12%
##   268:  17.96% 10.38% 30.12%
##   269:  17.96% 10.38% 30.12%
##   270:  17.85% 10.38% 29.82%
##   271:  17.85% 10.38% 29.82%
##   272:  17.96% 10.38% 30.12%
##   273:  17.96% 10.38% 30.12%
##   274:  17.85% 10.20% 30.12%
##   275:  17.96% 10.38% 30.12%
##   276:  18.18% 10.38% 30.70%
##   277:  18.07% 10.38% 30.41%
##   278:  17.96% 10.38% 30.12%
##   279:  18.18% 10.38% 30.70%
##   280:  17.85% 10.38% 29.82%
##   281:  17.62% 10.20% 29.53%
##   282:  18.18% 10.38% 30.70%
##   283:  17.96% 10.38% 30.12%
##   284:  17.96% 10.38% 30.12%
##   285:  17.85% 10.38% 29.82%
##   286:  17.85% 10.38% 29.82%
##   287:  17.85% 10.38% 29.82%
##   288:  17.73% 10.20% 29.82%
##   289:  17.73% 10.20% 29.82%
##   290:  17.85% 10.20% 30.12%
##   291:  17.73% 10.20% 29.82%
##   292:  17.51%  9.84% 29.82%
##   293:  17.73% 10.20% 29.82%
##   294:  17.85% 10.20% 30.12%
##   295:  17.85% 10.38% 29.82%
##   296:  17.51% 10.02% 29.53%
##   297:  17.73% 10.02% 30.12%
##   298:  17.62% 10.02% 29.82%
##   299:  17.62% 10.02% 29.82%
##   300:  17.51% 10.20% 29.24%
##   301:  17.51% 10.20% 29.24%
##   302:  17.62% 10.20% 29.53%
##   303:  17.62% 10.20% 29.53%
##   304:  17.62% 10.38% 29.24%
##   305:  17.62% 10.38% 29.24%
##   306:  17.62% 10.38% 29.24%
##   307:  17.73% 10.56% 29.24%
##   308:  17.62% 10.38% 29.24%
##   309:  17.73% 10.56% 29.24%
##   310:  17.73% 10.56% 29.24%
##   311:  17.85% 10.56% 29.53%
##   312:  17.85% 10.38% 29.82%
##   313:  17.96% 10.56% 29.82%
##   314:  17.73% 10.38% 29.53%
##   315:  17.96% 10.56% 29.82%
##   316:  17.85% 10.38% 29.82%
##   317:  17.73% 10.38% 29.53%
##   318:  17.85% 10.56% 29.53%
##   319:  17.96% 10.56% 29.82%
##   320:  17.73% 10.38% 29.53%
##   321:  17.73% 10.38% 29.53%
##   322:  17.85% 10.56% 29.53%
##   323:  17.85% 10.56% 29.53%
##   324:  17.62% 10.38% 29.24%
##   325:  17.62% 10.38% 29.24%
##   326:  17.73% 10.38% 29.53%
##   327:  17.51% 10.38% 28.95%
##   328:  17.73% 10.38% 29.53%
##   329:  17.62% 10.38% 29.24%
##   330:  17.73% 10.38% 29.53%
##   331:  17.85% 10.38% 29.82%
##   332:  17.62% 10.38% 29.24%
##   333:  17.73% 10.38% 29.53%
##   334:  17.73% 10.38% 29.53%
##   335:  17.96% 10.56% 29.82%
##   336:  17.73% 10.38% 29.53%
##   337:  17.51% 10.02% 29.53%
##   338:  17.40% 10.02% 29.24%
##   339:  17.40% 10.02% 29.24%
##   340:  17.40% 10.02% 29.24%
##   341:  17.51% 10.20% 29.24%
##   342:  17.51% 10.20% 29.24%
##   343:  17.51% 10.20% 29.24%
##   344:  17.51% 10.02% 29.53%
##   345:  17.51% 10.02% 29.53%
##   346:  17.73% 10.20% 29.82%
##   347:  17.51% 10.02% 29.53%
##   348:  17.62% 10.20% 29.53%
##   349:  17.51% 10.02% 29.53%
##   350:  17.51% 10.02% 29.53%
##   351:  17.62% 10.20% 29.53%
##   352:  17.62% 10.20% 29.53%
##   353:  17.73% 10.20% 29.82%
##   354:  17.85% 10.38% 29.82%
##   355:  17.85% 10.38% 29.82%
##   356:  17.62% 10.20% 29.53%
##   357:  17.73% 10.20% 29.82%
##   358:  17.73% 10.20% 29.82%
##   359:  17.73% 10.20% 29.82%
##   360:  17.85% 10.38% 29.82%
##   361:  17.73% 10.38% 29.53%
##   362:  17.85% 10.38% 29.82%
##   363:  17.73% 10.38% 29.53%
##   364:  17.73% 10.38% 29.53%
##   365:  17.51% 10.20% 29.24%
##   366:  17.62% 10.38% 29.24%
##   367:  17.51% 10.20% 29.24%
##   368:  17.51% 10.20% 29.24%
##   369:  17.51% 10.20% 29.24%
##   370:  17.62% 10.20% 29.53%
##   371:  17.51% 10.20% 29.24%
##   372:  17.62% 10.38% 29.24%
##   373:  17.73% 10.20% 29.82%
##   374:  17.62% 10.20% 29.53%
##   375:  17.62% 10.20% 29.53%
##   376:  17.62% 10.20% 29.53%
##   377:  17.62% 10.20% 29.53%
##   378:  17.62% 10.20% 29.53%
##   379:  17.73% 10.20% 29.82%
##   380:  17.73% 10.20% 29.82%
##   381:  17.73% 10.20% 29.82%
##   382:  17.73% 10.20% 29.82%
##   383:  17.73% 10.20% 29.82%
##   384:  17.73% 10.20% 29.82%
##   385:  17.73% 10.20% 29.82%
##   386:  17.73% 10.20% 29.82%
##   387:  17.73% 10.20% 29.82%
##   388:  17.73% 10.20% 29.82%
##   389:  17.73% 10.20% 29.82%
##   390:  17.73% 10.20% 29.82%
##   391:  17.73% 10.20% 29.82%
##   392:  17.73% 10.20% 29.82%
##   393:  17.73% 10.20% 29.82%
##   394:  17.73% 10.20% 29.82%
##   395:  17.73% 10.20% 29.82%
##   396:  17.73% 10.20% 29.82%
##   397:  17.73% 10.20% 29.82%
##   398:  17.73% 10.20% 29.82%
##   399:  17.73% 10.20% 29.82%
##   400:  17.73% 10.20% 29.82%
##   401:  17.73% 10.20% 29.82%
##   402:  17.73% 10.20% 29.82%
##   403:  17.73% 10.20% 29.82%
##   404:  17.73% 10.20% 29.82%
##   405:  17.73% 10.20% 29.82%
##   406:  17.62% 10.02% 29.82%
##   407:  17.73% 10.20% 29.82%
##   408:  17.73% 10.20% 29.82%
##   409:  17.73% 10.20% 29.82%
##   410:  17.62% 10.02% 29.82%
##   411:  17.62% 10.02% 29.82%
##   412:  17.62% 10.02% 29.82%
##   413:  17.62% 10.02% 29.82%
##   414:  17.62% 10.02% 29.82%
##   415:  17.62% 10.02% 29.82%
##   416:  17.62% 10.02% 29.82%
##   417:  17.62% 10.02% 29.82%
##   418:  17.62% 10.02% 29.82%
##   419:  17.62% 10.02% 29.82%
##   420:  17.62% 10.02% 29.82%
##   421:  17.62% 10.02% 29.82%
##   422:  17.62% 10.02% 29.82%
##   423:  17.62% 10.02% 29.82%
##   424:  17.62% 10.02% 29.82%
##   425:  17.62% 10.02% 29.82%
##   426:  17.62% 10.02% 29.82%
##   427:  17.62% 10.02% 29.82%
##   428:  17.62% 10.02% 29.82%
##   429:  17.62% 10.02% 29.82%
##   430:  17.62% 10.02% 29.82%
##   431:  17.62% 10.02% 29.82%
##   432:  17.73% 10.02% 30.12%
##   433:  17.73% 10.02% 30.12%
##   434:  17.73% 10.02% 30.12%
##   435:  17.62% 10.02% 29.82%
##   436:  17.62%  9.84% 30.12%
##   437:  17.62% 10.02% 29.82%
##   438:  17.51%  9.84% 29.82%
##   439:  17.51%  9.84% 29.82%
##   440:  17.51%  9.84% 29.82%
##   441:  17.51%  9.84% 29.82%
##   442:  17.51%  9.84% 29.82%
##   443:  17.51%  9.84% 29.82%
##   444:  17.51%  9.84% 29.82%
##   445:  17.51%  9.84% 29.82%
##   446:  17.51%  9.84% 29.82%
##   447:  17.51%  9.84% 29.82%
##   448:  17.62% 10.02% 29.82%
##   449:  17.51%  9.84% 29.82%
##   450:  17.51%  9.84% 29.82%
##   451:  17.51%  9.84% 29.82%
##   452:  17.51%  9.84% 29.82%
##   453:  17.51%  9.84% 29.82%
##   454:  17.62% 10.02% 29.82%
##   455:  17.62% 10.02% 29.82%
##   456:  17.51%  9.84% 29.82%
##   457:  17.51% 10.02% 29.53%
##   458:  17.62% 10.02% 29.82%
##   459:  17.62% 10.02% 29.82%
##   460:  17.73% 10.02% 30.12%
##   461:  17.73% 10.02% 30.12%
##   462:  17.73% 10.02% 30.12%
##   463:  17.62%  9.84% 30.12%
##   464:  17.62%  9.84% 30.12%
##   465:  17.51%  9.84% 29.82%
##   466:  17.73% 10.02% 30.12%
##   467:  17.73% 10.02% 30.12%
##   468:  17.51%  9.84% 29.82%
##   469:  17.62%  9.84% 30.12%
##   470:  17.62%  9.84% 30.12%
##   471:  17.73%  9.84% 30.41%
##   472:  17.73%  9.84% 30.41%
##   473:  17.73%  9.84% 30.41%
##   474:  17.85% 10.02% 30.41%
##   475:  17.73%  9.84% 30.41%
##   476:  17.73%  9.84% 30.41%
##   477:  17.85% 10.02% 30.41%
##   478:  17.85% 10.02% 30.41%
##   479:  17.73% 10.02% 30.12%
##   480:  17.85% 10.02% 30.41%
##   481:  17.73% 10.02% 30.12%
##   482:  17.73% 10.02% 30.12%
##   483:  17.73% 10.02% 30.12%
##   484:  17.73% 10.02% 30.12%
##   485:  17.73% 10.20% 29.82%
##   486:  17.73% 10.20% 29.82%
##   487:  17.62% 10.02% 29.82%
##   488:  17.73% 10.20% 29.82%
##   489:  17.62% 10.02% 29.82%
##   490:  17.73% 10.02% 30.12%
##   491:  17.73% 10.20% 29.82%
##   492:  17.73% 10.20% 29.82%
##   493:  17.73% 10.20% 29.82%
##   494:  17.73% 10.02% 30.12%
##   495:  17.73% 10.02% 30.12%
##   496:  17.73% 10.02% 30.12%
##   497:  17.73% 10.20% 29.82%
##   498:  17.85% 10.20% 30.12%
##   499:  17.73% 10.02% 30.12%
##   500:  17.73% 10.02% 30.12%

Let’s now call our model to see what’s in it.

modelrf

## 
## Call:
##  randomForest(formula = as.factor(PassengerData$Survived) ~ .,      data = PassengerData, do.trace = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 17.73%
## Confusion matrix:
##     0   1 class.error
## 0 494  55   0.1001821
## 1 103 239   0.3011696

As you can see, there are 500 trees that have been used to make this forest. Let’s find out the importance of each of the variables.

importance(modelrf)

##             MeanDecreaseGini
## PassengerId        62.139964
## Pclass             35.889853
## Age                58.417328
## SibSp              16.154976
## Parch              12.207239
## Fare               71.423022
## female            106.250916
## embarked_c          6.026943
## embarked_s          7.076869

varImpPlot(modelrf)

As can be seen, variable female is of utmost importance for us. Let’s now make the predictions on train dataset.

predictTr <- predict(modelrf, PassengerData)
table(predictTr, PassengerData$Survived)

##          
## predictTr   0   1
##         0 546  29
##         1   3 313

Thus, the accuracy for train dataset is (546+314)/(546+28+3+314) = 860/891 = 0.96521. Thus, accuracy on train dataset is 0.96521.

test <- read.csv("C:/Users/jyoti/Downloads/randomForests/test.csv")
summary(test)

##   PassengerId         Pclass     
##  Min.   : 892.0   Min.   :1.000  
##  1st Qu.: 996.2   1st Qu.:1.000  
##  Median :1100.5   Median :3.000  
##  Mean   :1100.5   Mean   :2.266  
##  3rd Qu.:1204.8   3rd Qu.:3.000  
##  Max.   :1309.0   Max.   :3.000  
##                                  
##                                         Name         Sex     
##  Abbott, Master. Eugene Joseph            :  1   female:152  
##  Abelseth, Miss. Karen Marie              :  1   male  :266  
##  Abelseth, Mr. Olaus Jorgensen            :  1               
##  Abrahamsson, Mr. Abraham August Johannes :  1               
##  Abrahim, Mrs. Joseph (Sophie Halaut Easu):  1               
##  Aks, Master. Philip Frank                :  1               
##  (Other)                                  :412               
##       Age            SibSp            Parch             Ticket   
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.0000   PC 17608:  5  
##  1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.0000   113503  :  4  
##  Median :27.00   Median :0.0000   Median :0.0000   CA. 2343:  4  
##  Mean   :30.27   Mean   :0.4474   Mean   :0.3923   16966   :  3  
##  3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.0000   220845  :  3  
##  Max.   :76.00   Max.   :8.0000   Max.   :9.0000   347077  :  3  
##  NA's   :86                                        (Other) :396  
##       Fare                     Cabin     Embarked
##  Min.   :  0.000                  :327   C:102   
##  1st Qu.:  7.896   B57 B59 B63 B66:  3   Q: 46   
##  Median : 14.454   A34            :  2   S:270   
##  Mean   : 35.627   B45            :  2           
##  3rd Qu.: 31.500   C101           :  2           
##  Max.   :512.329   C116           :  2           
##  NA's   :1         (Other)        : 80

Since there are missing values in the test dataset also, we will follow the same series of steps as we had done for train data.

hist(test$Age)

Let’s replace missing values with mean as the distribution is more or less normal in nature.

test$Age[is.na(test$Age)]=30.27
summary(test)

##   PassengerId         Pclass     
##  Min.   : 892.0   Min.   :1.000  
##  1st Qu.: 996.2   1st Qu.:1.000  
##  Median :1100.5   Median :3.000  
##  Mean   :1100.5   Mean   :2.266  
##  3rd Qu.:1204.8   3rd Qu.:3.000  
##  Max.   :1309.0   Max.   :3.000  
##                                  
##                                         Name         Sex     
##  Abbott, Master. Eugene Joseph            :  1   female:152  
##  Abelseth, Miss. Karen Marie              :  1   male  :266  
##  Abelseth, Mr. Olaus Jorgensen            :  1               
##  Abrahamsson, Mr. Abraham August Johannes :  1               
##  Abrahim, Mrs. Joseph (Sophie Halaut Easu):  1               
##  Aks, Master. Philip Frank                :  1               
##  (Other)                                  :412               
##       Age            SibSp            Parch             Ticket   
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.0000   PC 17608:  5  
##  1st Qu.:23.00   1st Qu.:0.0000   1st Qu.:0.0000   113503  :  4  
##  Median :30.27   Median :0.0000   Median :0.0000   CA. 2343:  4  
##  Mean   :30.27   Mean   :0.4474   Mean   :0.3923   16966   :  3  
##  3rd Qu.:35.75   3rd Qu.:1.0000   3rd Qu.:0.0000   220845  :  3  
##  Max.   :76.00   Max.   :8.0000   Max.   :9.0000   347077  :  3  
##                                                    (Other) :396  
##       Fare                     Cabin     Embarked
##  Min.   :  0.000                  :327   C:102   
##  1st Qu.:  7.896   B57 B59 B63 B66:  3   Q: 46   
##  Median : 14.454   A34            :  2   S:270   
##  Mean   : 35.627   B45            :  2           
##  3rd Qu.: 31.500   C101           :  2           
##  Max.   :512.329   C116           :  2           
##  NA's   :1         (Other)        : 80

There is one missing value in fare too.

hist(test$Fare)

Since the variable is skewed, let’s replace the missing value with mean.

test$Fare[is.na(test$Fare)] = 14.454
summary(test)

##   PassengerId         Pclass     
##  Min.   : 892.0   Min.   :1.000  
##  1st Qu.: 996.2   1st Qu.:1.000  
##  Median :1100.5   Median :3.000  
##  Mean   :1100.5   Mean   :2.266  
##  3rd Qu.:1204.8   3rd Qu.:3.000  
##  Max.   :1309.0   Max.   :3.000  
##                                  
##                                         Name         Sex     
##  Abbott, Master. Eugene Joseph            :  1   female:152  
##  Abelseth, Miss. Karen Marie              :  1   male  :266  
##  Abelseth, Mr. Olaus Jorgensen            :  1               
##  Abrahamsson, Mr. Abraham August Johannes :  1               
##  Abrahim, Mrs. Joseph (Sophie Halaut Easu):  1               
##  Aks, Master. Philip Frank                :  1               
##  (Other)                                  :412               
##       Age            SibSp            Parch             Ticket   
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.0000   PC 17608:  5  
##  1st Qu.:23.00   1st Qu.:0.0000   1st Qu.:0.0000   113503  :  4  
##  Median :30.27   Median :0.0000   Median :0.0000   CA. 2343:  4  
##  Mean   :30.27   Mean   :0.4474   Mean   :0.3923   16966   :  3  
##  3rd Qu.:35.75   3rd Qu.:1.0000   3rd Qu.:0.0000   220845  :  3  
##  Max.   :76.00   Max.   :8.0000   Max.   :9.0000   347077  :  3  
##                                                    (Other) :396  
##       Fare                     Cabin     Embarked
##  Min.   :  0.000                  :327   C:102   
##  1st Qu.:  7.896   B57 B59 B63 B66:  3   Q: 46   
##  Median : 14.454   A34            :  2   S:270   
##  Mean   : 35.577   B45            :  2           
##  3rd Qu.: 31.472   C101           :  2           
##  Max.   :512.329   C116           :  2           
##                    (Other)        : 80

Let’s now do feature engineering.

test$female = ifelse(test$Sex=="female", 1, 0)
test$embarked_c = ifelse(test$Embarked=="C", 1, 0)
test$embarked_s = ifelse(test$Embarked=="S", 1, 0)
head(test)

Let’s remove name, sex etc variables as we did in training set.

newtest = data.frame(test)
newtest = newtest[-c(3, 4, 8, 10, 11)]
head(newtest)

Let’s now do the predictions.

newtest$predicted<-predict(modelrf, newtest)
head(newtest$predicted)

## [1] 0 0 0 0 0 0
## Levels: 0 1

Let’s save the passengerId variable and predictions in a new dataframe.

submission<- data.frame(matrix(nrow=nrow(test)))
submission$PassengerId <- newtest$PassengerId
submission$Survived <- newtest$predicted
submission <- submission[-c(1)]

Let’s now save the dataframe as a csv file.

write.csv(submission, 'submission.csv', row.names = FALSE)

2 thoughts on “Random Forests Using R: Titanic Case Study”

Ash says:

August 10, 2018 at 9:18 am

Thanks Jyoti for sharing.
How can we proceed for data preprocessing when there are 1000 variable. What steps need to be followed to manage outlier.

Loading...

1. Jyoti Dixit says:
  
  August 10, 2018 at 9:23 am
  
  Welcome Ash! The best way to manage outliers is to look at it is to look at the quantiles and box plot of a variable to see if there are outliers and then replacing them with the value corresponding to 96th percentile. Also, you can opt for regularization as it minimizes the effects of outliers. I have made two posts on L2 and L1 regularization. Please go through it. You will be able to handle them using their implementation by deep learning.
  
  Loading...