Random Forests using R: Titanic Case Study
Random Forests using R: Introduction
Hi MLEnthusiasts! Today, we will learn how to implement random forests using R that too on a well-known dataset, The Titanic Dataset! So, our analysis onvolves getting some information about the dataset, like what all variables are in our dataset and what do we have to predict. This tutorial will make you know how to follow the guidelines of kaggle and how to make submissions in kaggle.
The Data set
The train and test data files can be found on this link of kaggle. Following are the variables of this dataset: survival: Tells whether a particular passenger survived or not. 0 for not survived, 1 for survived. pClass: Ticket class, 1 for 1st class, 2 for 2nd class and 3 for 3rd class. sex: Tells us the gender of the passenger Age: in years sibsp: # of siblings or spouses aboarding the titanic parch: # of parents/children of the passenger fare: passenger fare embarked: The port of embarkment; C for Cherbourg, Q for Queenstown and S for Southampton
Problem Statement
Having seen what the data is all about, let’s also understand the problem statement. The problem is to make a predictive model which predicts whether a passenger having given parameters will survive or not. By looking closely at the problem, we can say that it’s a binary classification problem(0/1) which we will try to solve using random forests.
Importing the data
Let us first set our working directory and import our dataset.
data <- read.csv("C:/Users/jyoti/Downloads/randomForests/train.csv")
Viewing and summarizing the data
Here, data is a dataframe having all the variables and data of those variables. The dataframe has 891 observations of 12 variables. The next step is to view the data inside the dataframe.
View(data)
Now starts the first main step, “Data Preparation”. To see if there is any missing data or to know about the mean or standard deviation, we use the summary() function.
summary(data)
## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## :687 : 2
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
Understanding distributions for Missing value imputation
As can be seen, there are 177 missing values in the Age variable. We need to do missing value imputation in this case. But, before doing that, we need to check how the age distribution looks like so that we can know which imputation method to choose and apply.
hist(data$Age)
Since the distribution looks somewhat normal, we can use mean value imputation in this case. That is, we can replace the missing values with the mean of the age. This doesn’t deviate the mean and the distribution of the age remains the same.
data$Age[is.na(data$Age)] = 29.07
summary(data)
## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:22.00
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :29.07
## Abelson, Mr. Samuel : 1 Mean :29.57
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:35.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## :687 : 2
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
As can be seen above, age doesn’t have any missing value now. Let’s see how the data looks like now.
head(data)
Concept of dummy variables
Now, let us understand the concept of dummy variables. Suppose a variable “A” has n classes. This variable A can be replaced by n-1 variables. If A has i, j, k, …, classes, then A_i = 1 in the rows at which i appears in A’s column and 0 for the rest of the rows. Same applies for j, k.. etc. The last value gets taken care of by the intercept. So, let’s introduce dummy variables inside our data for sex and embarked columns since they are holding the categorical data.
data$female = ifelse(data$Sex=="female", 1, 0)
data$embarked_c = ifelse(data$Embarked=="C", 1, 0)
data$embarked_s = ifelse(data$Embarked=="S", 1, 0)
head(data)
Subset the data
Now, if you have a look at dataframe, it contains 15 variables and not 12. The next step is to remove those variables which we no longer need in the model making, Name, Sex since it is already taken into account by female variable, Ticket, Cabin, embarked, i.e. column number 4, 5, 9, 11 and 12.
PassengerData = data[-c(4, 5, 9, 11, 12)]
head(PassengerData)
Univariate analysis
Let’s do univariate analysis of the numerical variables, age and fare now.
bx = boxplot(PassengerData$Age)
Thus, there are outliers in the age variable and we need to do outlier handling in this case.
bx$stats
## [,1]
## [1,] 3.00
## [2,] 22.00
## [3,] 29.07
## [4,] 35.00
## [5,] 54.00
quantile(PassengerData$Age, seq(0, 1, 0.02))
## 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% 22%
## 0.42 2.00 4.00 8.40 14.00 16.00 17.00 18.00 19.00 19.00 20.00 21.00
## 24% 26% 28% 30% 32% 34% 36% 38% 40% 42% 44% 46%
## 22.00 23.00 24.00 24.00 25.00 26.00 27.00 28.00 28.00 29.00 29.07 29.07
## 48% 50% 52% 54% 56% 58% 60% 62% 64% 66% 68% 70%
## 29.07 29.07 29.07 29.07 29.07 29.07 29.07 29.07 30.00 30.70 32.00 32.50
## 72% 74% 76% 78% 80% 82% 84% 86% 88% 90% 92% 94%
## 34.00 35.00 36.00 36.00 38.00 40.00 41.00 43.00 45.00 47.00 50.00 52.00
## 96% 98% 100%
## 56.00 62.00 80.00
Outlier Handling
We can replace the outliers above 96% of the quantile range and below 4% of the quantile range so that more accuracy is obtained and the data loss is also not very significant.
PassengerData$Age = ifelse(PassengerData$Age>=56, 56, PassengerData$Age)
PassengerData$Age = ifelse(PassengerData$Age<=3, 3, PassengerData$Age)
boxplot(PassengerData$Age)
The boxplot comes out to be neat in this case after outlier handling. Let us now do analysis for fare variable.
bx = boxplot(PassengerData$Fare)
bx$stats
## [,1]
## [1,] 0.0000
## [2,] 7.9104
## [3,] 14.4542
## [4,] 31.0000
## [5,] 65.0000
Thus, there is a very large amount of outlier data on the upper end.
quantile(PassengerData$Fare, seq(0, 1, 0.02))
## 0% 2% 4% 6% 8% 10% 12%
## 0.00000 6.39750 7.05252 7.22500 7.25000 7.55000 7.75000
## 14% 16% 18% 20% 22% 24% 26%
## 7.75000 7.77500 7.79580 7.85420 7.89580 7.89580 7.92500
## 28% 30% 32% 34% 36% 38% 40%
## 8.05000 8.05000 8.10000 8.66250 9.50000 10.47000 10.50000
## 42% 44% 46% 48% 50% 52% 54%
## 12.22000 13.00000 13.00000 13.08334 14.45420 15.24580 15.85000
## 56% 58% 60% 62% 64% 66% 68%
## 17.88000 20.22000 21.67920 24.15000 26.00000 26.00000 26.30750
## 70% 72% 74% 76% 78% 80% 82%
## 27.00000 28.94250 30.32832 31.38750 35.50000 39.68750 49.90084
## 84% 86% 88% 90% 92% 94% 96%
## 53.10000 57.39168 69.55000 77.95830 82.17080 93.50000 133.99000
## 98% 100%
## 211.33750 512.32920
As can be seen above, the major difference between the values arises above 96% of the quantile.
PassengerData$Fare = ifelse(PassengerData$Fare>=133.99, 133.99, PassengerData$Fare)
boxplot(PassengerData$Fare)
Bivariate Analysis
Let us now start our bivariate analysis.
library(car)
## Loading required package: carData
scatterplot(PassengerData$Age, PassengerData$Survived)
It is to be noted that children and old passengers were saved first during the titanic mishap.
scatterplot(PassengerData$Fare, PassengerData$Survived)
Modeling random forests using R
Now, let’s make our random forests model.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(e1071)
The next step is to make our random forests model.
modelrf <- randomForest(as.factor(PassengerData$Survived)~., data = PassengerData, do.trace = T)
## ntree OOB 1 2
## 1: 26.48% 19.50% 38.02%
## 2: 23.88% 16.41% 35.75%
## 3: 22.86% 14.29% 36.29%
## 4: 22.65% 14.10% 36.05%
## 5: 21.84% 13.82% 34.39%
## 6: 22.28% 14.99% 33.54%
## 7: 22.72% 15.68% 33.84%
## 8: 21.31% 12.59% 35.12%
## 9: 21.26% 13.94% 32.94%
## 10: 20.75% 13.63% 32.15%
## 11: 20.79% 13.21% 32.94%
## 12: 20.88% 13.92% 32.06%
## 13: 20.02% 12.77% 31.67%
## 14: 20.22% 12.02% 33.43%
## 15: 20.00% 12.20% 32.55%
## 16: 19.87% 12.39% 31.87%
## 17: 19.42% 12.02% 31.29%
## 18: 18.97% 11.84% 30.41%
## 19: 19.53% 12.93% 30.12%
## 20: 18.97% 12.02% 30.12%
## 21: 18.97% 12.20% 29.82%
## 22: 19.75% 12.93% 30.70%
## 23: 19.53% 12.93% 30.12%
## 24: 18.97% 12.20% 29.82%
## 25: 19.75% 12.93% 30.70%
## 26: 19.19% 12.75% 29.53%
## 27: 19.75% 13.11% 30.41%
## 28: 18.74% 12.20% 29.24%
## 29: 19.19% 12.93% 29.24%
## 30: 19.08% 12.75% 29.24%
## 31: 19.53% 12.57% 30.70%
## 32: 19.42% 12.39% 30.70%
## 33: 19.08% 12.02% 30.41%
## 34: 19.19% 11.84% 30.99%
## 35: 19.53% 12.02% 31.58%
## 36: 19.08% 11.48% 31.29%
## 37: 19.42% 11.84% 31.58%
## 38: 19.08% 11.11% 31.87%
## 39: 19.42% 11.66% 31.87%
## 40: 18.86% 11.11% 31.29%
## 41: 19.42% 11.29% 32.46%
## 42: 19.19% 10.93% 32.46%
## 43: 19.19% 11.11% 32.16%
## 44: 19.08% 11.11% 31.87%
## 45: 18.41% 11.11% 30.12%
## 46: 17.85% 10.20% 30.12%
## 47: 18.29% 10.75% 30.41%
## 48: 18.41% 11.11% 30.12%
## 49: 18.07% 10.75% 29.82%
## 50: 17.96% 10.75% 29.53%
## 51: 17.40% 10.56% 28.36%
## 52: 17.85% 10.75% 29.24%
## 53: 18.07% 10.93% 29.53%
## 54: 18.07% 10.93% 29.53%
## 55: 18.18% 10.93% 29.82%
## 56: 18.63% 11.48% 30.12%
## 57: 18.18% 11.11% 29.53%
## 58: 18.07% 11.29% 28.95%
## 59: 18.18% 11.11% 29.53%
## 60: 18.52% 10.93% 30.70%
## 61: 18.52% 11.11% 30.41%
## 62: 17.85% 10.38% 29.82%
## 63: 18.07% 10.75% 29.82%
## 64: 18.07% 10.56% 30.12%
## 65: 17.73% 10.56% 29.24%
## 66: 18.07% 10.75% 29.82%
## 67: 17.73% 10.56% 29.24%
## 68: 18.07% 10.56% 30.12%
## 69: 18.29% 10.75% 30.41%
## 70: 17.73% 10.56% 29.24%
## 71: 18.07% 10.38% 30.41%
## 72: 18.29% 10.75% 30.41%
## 73: 17.85% 10.93% 28.95%
## 74: 18.52% 11.11% 30.41%
## 75: 18.41% 10.75% 30.70%
## 76: 18.07% 10.93% 29.53%
## 77: 18.63% 11.29% 30.41%
## 78: 18.63% 10.93% 30.99%
## 79: 18.52% 10.93% 30.70%
## 80: 18.52% 11.29% 30.12%
## 81: 18.63% 11.48% 30.12%
## 82: 18.07% 10.75% 29.82%
## 83: 18.41% 10.93% 30.41%
## 84: 18.07% 10.93% 29.53%
## 85: 18.63% 11.29% 30.41%
## 86: 18.86% 11.29% 30.99%
## 87: 19.08% 11.11% 31.87%
## 88: 18.86% 11.11% 31.29%
## 89: 18.52% 10.93% 30.70%
## 90: 18.63% 11.29% 30.41%
## 91: 18.86% 11.48% 30.70%
## 92: 18.74% 11.29% 30.70%
## 93: 18.97% 11.29% 31.29%
## 94: 18.29% 10.56% 30.70%
## 95: 18.18% 11.11% 29.53%
## 96: 18.29% 10.93% 30.12%
## 97: 18.41% 11.11% 30.12%
## 98: 18.52% 10.93% 30.70%
## 99: 18.29% 10.93% 30.12%
## 100: 18.52% 10.93% 30.70%
## 101: 18.18% 10.93% 29.82%
## 102: 18.18% 10.75% 30.12%
## 103: 18.18% 10.75% 30.12%
## 104: 17.85% 10.56% 29.53%
## 105: 17.96% 10.38% 30.12%
## 106: 18.29% 10.93% 30.12%
## 107: 18.29% 11.11% 29.82%
## 108: 18.07% 10.93% 29.53%
## 109: 17.96% 10.56% 29.82%
## 110: 17.96% 10.75% 29.53%
## 111: 18.18% 10.75% 30.12%
## 112: 18.41% 10.93% 30.41%
## 113: 18.07% 10.38% 30.41%
## 114: 18.29% 10.56% 30.70%
## 115: 18.52% 10.75% 30.99%
## 116: 18.41% 10.56% 30.99%
## 117: 18.74% 10.56% 31.87%
## 118: 18.29% 10.56% 30.70%
## 119: 18.63% 10.56% 31.58%
## 120: 18.29% 10.02% 31.58%
## 121: 18.18% 10.20% 30.99%
## 122: 18.41% 10.38% 31.29%
## 123: 18.41% 10.20% 31.58%
## 124: 18.29% 10.20% 31.29%
## 125: 18.07% 10.02% 30.99%
## 126: 18.18% 10.02% 31.29%
## 127: 18.07% 10.02% 30.99%
## 128: 18.18% 9.84% 31.58%
## 129: 17.96% 9.84% 30.99%
## 130: 18.18% 10.02% 31.29%
## 131: 17.96% 9.84% 30.99%
## 132: 17.96% 9.84% 30.99%
## 133: 17.96% 9.84% 30.99%
## 134: 17.73% 9.65% 30.70%
## 135: 17.73% 9.65% 30.70%
## 136: 17.73% 9.84% 30.41%
## 137: 17.73% 9.84% 30.41%
## 138: 17.51% 9.84% 29.82%
## 139: 17.73% 9.84% 30.41%
## 140: 17.28% 9.65% 29.53%
## 141: 17.85% 9.84% 30.70%
## 142: 18.07% 10.20% 30.70%
## 143: 17.85% 9.84% 30.70%
## 144: 17.73% 9.84% 30.41%
## 145: 18.41% 10.38% 31.29%
## 146: 18.07% 10.20% 30.70%
## 147: 17.96% 10.20% 30.41%
## 148: 17.73% 10.20% 29.82%
## 149: 18.07% 10.02% 30.99%
## 150: 17.73% 9.84% 30.41%
## 151: 18.07% 10.02% 30.99%
## 152: 18.18% 9.84% 31.58%
## 153: 18.18% 9.84% 31.58%
## 154: 17.85% 9.65% 30.99%
## 155: 17.85% 9.84% 30.70%
## 156: 17.96% 9.84% 30.99%
## 157: 18.07% 9.84% 31.29%
## 158: 17.85% 9.84% 30.70%
## 159: 17.96% 9.84% 30.99%
## 160: 18.07% 9.84% 31.29%
## 161: 18.07% 10.02% 30.99%
## 162: 17.96% 9.84% 30.99%
## 163: 18.07% 10.02% 30.99%
## 164: 18.07% 10.02% 30.99%
## 165: 18.18% 10.02% 31.29%
## 166: 18.18% 10.02% 31.29%
## 167: 18.07% 9.84% 31.29%
## 168: 18.07% 10.02% 30.99%
## 169: 18.07% 10.02% 30.99%
## 170: 17.96% 9.65% 31.29%
## 171: 17.85% 9.65% 30.99%
## 172: 17.85% 9.65% 30.99%
## 173: 17.85% 9.84% 30.70%
## 174: 17.73% 9.65% 30.70%
## 175: 17.85% 9.84% 30.70%
## 176: 17.62% 9.65% 30.41%
## 177: 17.85% 9.84% 30.70%
## 178: 17.85% 9.84% 30.70%
## 179: 17.85% 10.02% 30.41%
## 180: 17.73% 9.84% 30.41%
## 181: 17.62% 9.84% 30.12%
## 182: 17.73% 9.84% 30.41%
## 183: 17.62% 9.84% 30.12%
## 184: 17.62% 10.02% 29.82%
## 185: 17.85% 10.02% 30.41%
## 186: 17.73% 9.84% 30.41%
## 187: 17.73% 9.84% 30.41%
## 188: 17.62% 10.02% 29.82%
## 189: 17.62% 10.02% 29.82%
## 190: 17.62% 9.65% 30.41%
## 191: 17.62% 9.84% 30.12%
## 192: 17.51% 9.84% 29.82%
## 193: 17.62% 9.84% 30.12%
## 194: 17.40% 9.65% 29.82%
## 195: 17.51% 9.65% 30.12%
## 196: 17.73% 10.20% 29.82%
## 197: 17.62% 10.02% 29.82%
## 198: 17.62% 10.02% 29.82%
## 199: 17.73% 10.02% 30.12%
## 200: 17.62% 9.84% 30.12%
## 201: 17.62% 9.84% 30.12%
## 202: 17.62% 9.84% 30.12%
## 203: 17.62% 10.02% 29.82%
## 204: 17.51% 10.02% 29.53%
## 205: 17.62% 10.02% 29.82%
## 206: 17.51% 9.84% 29.82%
## 207: 17.73% 10.02% 30.12%
## 208: 17.62% 9.84% 30.12%
## 209: 17.28% 9.84% 29.24%
## 210: 17.62% 9.84% 30.12%
## 211: 17.85% 10.02% 30.41%
## 212: 17.73% 10.02% 30.12%
## 213: 17.85% 10.20% 30.12%
## 214: 17.73% 10.02% 30.12%
## 215: 17.51% 10.02% 29.53%
## 216: 17.73% 10.20% 29.82%
## 217: 17.85% 10.20% 30.12%
## 218: 17.51% 10.20% 29.24%
## 219: 17.62% 10.20% 29.53%
## 220: 17.62% 10.20% 29.53%
## 221: 17.85% 10.38% 29.82%
## 222: 17.62% 10.20% 29.53%
## 223: 17.62% 10.20% 29.53%
## 224: 17.51% 10.20% 29.24%
## 225: 17.28% 10.02% 28.95%
## 226: 17.40% 10.02% 29.24%
## 227: 17.17% 10.02% 28.65%
## 228: 17.62% 10.20% 29.53%
## 229: 17.62% 10.02% 29.82%
## 230: 17.62% 10.02% 29.82%
## 231: 17.51% 10.02% 29.53%
## 232: 17.40% 10.02% 29.24%
## 233: 17.28% 10.02% 28.95%
## 234: 17.28% 10.02% 28.95%
## 235: 17.40% 10.02% 29.24%
## 236: 17.40% 10.02% 29.24%
## 237: 17.62% 10.02% 29.82%
## 238: 17.96% 10.02% 30.70%
## 239: 17.85% 10.02% 30.41%
## 240: 17.96% 10.02% 30.70%
## 241: 17.96% 9.84% 30.99%
## 242: 17.85% 9.84% 30.70%
## 243: 18.18% 10.02% 31.29%
## 244: 18.07% 9.84% 31.29%
## 245: 17.96% 10.02% 30.70%
## 246: 17.85% 9.84% 30.70%
## 247: 17.96% 9.84% 30.99%
## 248: 18.07% 9.84% 31.29%
## 249: 17.96% 10.02% 30.70%
## 250: 17.96% 9.84% 30.99%
## 251: 17.96% 9.84% 30.99%
## 252: 17.85% 9.84% 30.70%
## 253: 17.85% 9.84% 30.70%
## 254: 17.62% 9.84% 30.12%
## 255: 17.73% 9.84% 30.41%
## 256: 17.73% 10.02% 30.12%
## 257: 17.73% 10.02% 30.12%
## 258: 17.96% 10.38% 30.12%
## 259: 17.73% 10.02% 30.12%
## 260: 17.85% 10.20% 30.12%
## 261: 18.07% 10.20% 30.70%
## 262: 17.96% 10.38% 30.12%
## 263: 17.96% 10.38% 30.12%
## 264: 17.96% 10.38% 30.12%
## 265: 17.96% 10.38% 30.12%
## 266: 17.96% 10.38% 30.12%
## 267: 17.96% 10.38% 30.12%
## 268: 17.96% 10.38% 30.12%
## 269: 17.96% 10.38% 30.12%
## 270: 17.85% 10.38% 29.82%
## 271: 17.85% 10.38% 29.82%
## 272: 17.96% 10.38% 30.12%
## 273: 17.96% 10.38% 30.12%
## 274: 17.85% 10.20% 30.12%
## 275: 17.96% 10.38% 30.12%
## 276: 18.18% 10.38% 30.70%
## 277: 18.07% 10.38% 30.41%
## 278: 17.96% 10.38% 30.12%
## 279: 18.18% 10.38% 30.70%
## 280: 17.85% 10.38% 29.82%
## 281: 17.62% 10.20% 29.53%
## 282: 18.18% 10.38% 30.70%
## 283: 17.96% 10.38% 30.12%
## 284: 17.96% 10.38% 30.12%
## 285: 17.85% 10.38% 29.82%
## 286: 17.85% 10.38% 29.82%
## 287: 17.85% 10.38% 29.82%
## 288: 17.73% 10.20% 29.82%
## 289: 17.73% 10.20% 29.82%
## 290: 17.85% 10.20% 30.12%
## 291: 17.73% 10.20% 29.82%
## 292: 17.51% 9.84% 29.82%
## 293: 17.73% 10.20% 29.82%
## 294: 17.85% 10.20% 30.12%
## 295: 17.85% 10.38% 29.82%
## 296: 17.51% 10.02% 29.53%
## 297: 17.73% 10.02% 30.12%
## 298: 17.62% 10.02% 29.82%
## 299: 17.62% 10.02% 29.82%
## 300: 17.51% 10.20% 29.24%
## 301: 17.51% 10.20% 29.24%
## 302: 17.62% 10.20% 29.53%
## 303: 17.62% 10.20% 29.53%
## 304: 17.62% 10.38% 29.24%
## 305: 17.62% 10.38% 29.24%
## 306: 17.62% 10.38% 29.24%
## 307: 17.73% 10.56% 29.24%
## 308: 17.62% 10.38% 29.24%
## 309: 17.73% 10.56% 29.24%
## 310: 17.73% 10.56% 29.24%
## 311: 17.85% 10.56% 29.53%
## 312: 17.85% 10.38% 29.82%
## 313: 17.96% 10.56% 29.82%
## 314: 17.73% 10.38% 29.53%
## 315: 17.96% 10.56% 29.82%
## 316: 17.85% 10.38% 29.82%
## 317: 17.73% 10.38% 29.53%
## 318: 17.85% 10.56% 29.53%
## 319: 17.96% 10.56% 29.82%
## 320: 17.73% 10.38% 29.53%
## 321: 17.73% 10.38% 29.53%
## 322: 17.85% 10.56% 29.53%
## 323: 17.85% 10.56% 29.53%
## 324: 17.62% 10.38% 29.24%
## 325: 17.62% 10.38% 29.24%
## 326: 17.73% 10.38% 29.53%
## 327: 17.51% 10.38% 28.95%
## 328: 17.73% 10.38% 29.53%
## 329: 17.62% 10.38% 29.24%
## 330: 17.73% 10.38% 29.53%
## 331: 17.85% 10.38% 29.82%
## 332: 17.62% 10.38% 29.24%
## 333: 17.73% 10.38% 29.53%
## 334: 17.73% 10.38% 29.53%
## 335: 17.96% 10.56% 29.82%
## 336: 17.73% 10.38% 29.53%
## 337: 17.51% 10.02% 29.53%
## 338: 17.40% 10.02% 29.24%
## 339: 17.40% 10.02% 29.24%
## 340: 17.40% 10.02% 29.24%
## 341: 17.51% 10.20% 29.24%
## 342: 17.51% 10.20% 29.24%
## 343: 17.51% 10.20% 29.24%
## 344: 17.51% 10.02% 29.53%
## 345: 17.51% 10.02% 29.53%
## 346: 17.73% 10.20% 29.82%
## 347: 17.51% 10.02% 29.53%
## 348: 17.62% 10.20% 29.53%
## 349: 17.51% 10.02% 29.53%
## 350: 17.51% 10.02% 29.53%
## 351: 17.62% 10.20% 29.53%
## 352: 17.62% 10.20% 29.53%
## 353: 17.73% 10.20% 29.82%
## 354: 17.85% 10.38% 29.82%
## 355: 17.85% 10.38% 29.82%
## 356: 17.62% 10.20% 29.53%
## 357: 17.73% 10.20% 29.82%
## 358: 17.73% 10.20% 29.82%
## 359: 17.73% 10.20% 29.82%
## 360: 17.85% 10.38% 29.82%
## 361: 17.73% 10.38% 29.53%
## 362: 17.85% 10.38% 29.82%
## 363: 17.73% 10.38% 29.53%
## 364: 17.73% 10.38% 29.53%
## 365: 17.51% 10.20% 29.24%
## 366: 17.62% 10.38% 29.24%
## 367: 17.51% 10.20% 29.24%
## 368: 17.51% 10.20% 29.24%
## 369: 17.51% 10.20% 29.24%
## 370: 17.62% 10.20% 29.53%
## 371: 17.51% 10.20% 29.24%
## 372: 17.62% 10.38% 29.24%
## 373: 17.73% 10.20% 29.82%
## 374: 17.62% 10.20% 29.53%
## 375: 17.62% 10.20% 29.53%
## 376: 17.62% 10.20% 29.53%
## 377: 17.62% 10.20% 29.53%
## 378: 17.62% 10.20% 29.53%
## 379: 17.73% 10.20% 29.82%
## 380: 17.73% 10.20% 29.82%
## 381: 17.73% 10.20% 29.82%
## 382: 17.73% 10.20% 29.82%
## 383: 17.73% 10.20% 29.82%
## 384: 17.73% 10.20% 29.82%
## 385: 17.73% 10.20% 29.82%
## 386: 17.73% 10.20% 29.82%
## 387: 17.73% 10.20% 29.82%
## 388: 17.73% 10.20% 29.82%
## 389: 17.73% 10.20% 29.82%
## 390: 17.73% 10.20% 29.82%
## 391: 17.73% 10.20% 29.82%
## 392: 17.73% 10.20% 29.82%
## 393: 17.73% 10.20% 29.82%
## 394: 17.73% 10.20% 29.82%
## 395: 17.73% 10.20% 29.82%
## 396: 17.73% 10.20% 29.82%
## 397: 17.73% 10.20% 29.82%
## 398: 17.73% 10.20% 29.82%
## 399: 17.73% 10.20% 29.82%
## 400: 17.73% 10.20% 29.82%
## 401: 17.73% 10.20% 29.82%
## 402: 17.73% 10.20% 29.82%
## 403: 17.73% 10.20% 29.82%
## 404: 17.73% 10.20% 29.82%
## 405: 17.73% 10.20% 29.82%
## 406: 17.62% 10.02% 29.82%
## 407: 17.73% 10.20% 29.82%
## 408: 17.73% 10.20% 29.82%
## 409: 17.73% 10.20% 29.82%
## 410: 17.62% 10.02% 29.82%
## 411: 17.62% 10.02% 29.82%
## 412: 17.62% 10.02% 29.82%
## 413: 17.62% 10.02% 29.82%
## 414: 17.62% 10.02% 29.82%
## 415: 17.62% 10.02% 29.82%
## 416: 17.62% 10.02% 29.82%
## 417: 17.62% 10.02% 29.82%
## 418: 17.62% 10.02% 29.82%
## 419: 17.62% 10.02% 29.82%
## 420: 17.62% 10.02% 29.82%
## 421: 17.62% 10.02% 29.82%
## 422: 17.62% 10.02% 29.82%
## 423: 17.62% 10.02% 29.82%
## 424: 17.62% 10.02% 29.82%
## 425: 17.62% 10.02% 29.82%
## 426: 17.62% 10.02% 29.82%
## 427: 17.62% 10.02% 29.82%
## 428: 17.62% 10.02% 29.82%
## 429: 17.62% 10.02% 29.82%
## 430: 17.62% 10.02% 29.82%
## 431: 17.62% 10.02% 29.82%
## 432: 17.73% 10.02% 30.12%
## 433: 17.73% 10.02% 30.12%
## 434: 17.73% 10.02% 30.12%
## 435: 17.62% 10.02% 29.82%
## 436: 17.62% 9.84% 30.12%
## 437: 17.62% 10.02% 29.82%
## 438: 17.51% 9.84% 29.82%
## 439: 17.51% 9.84% 29.82%
## 440: 17.51% 9.84% 29.82%
## 441: 17.51% 9.84% 29.82%
## 442: 17.51% 9.84% 29.82%
## 443: 17.51% 9.84% 29.82%
## 444: 17.51% 9.84% 29.82%
## 445: 17.51% 9.84% 29.82%
## 446: 17.51% 9.84% 29.82%
## 447: 17.51% 9.84% 29.82%
## 448: 17.62% 10.02% 29.82%
## 449: 17.51% 9.84% 29.82%
## 450: 17.51% 9.84% 29.82%
## 451: 17.51% 9.84% 29.82%
## 452: 17.51% 9.84% 29.82%
## 453: 17.51% 9.84% 29.82%
## 454: 17.62% 10.02% 29.82%
## 455: 17.62% 10.02% 29.82%
## 456: 17.51% 9.84% 29.82%
## 457: 17.51% 10.02% 29.53%
## 458: 17.62% 10.02% 29.82%
## 459: 17.62% 10.02% 29.82%
## 460: 17.73% 10.02% 30.12%
## 461: 17.73% 10.02% 30.12%
## 462: 17.73% 10.02% 30.12%
## 463: 17.62% 9.84% 30.12%
## 464: 17.62% 9.84% 30.12%
## 465: 17.51% 9.84% 29.82%
## 466: 17.73% 10.02% 30.12%
## 467: 17.73% 10.02% 30.12%
## 468: 17.51% 9.84% 29.82%
## 469: 17.62% 9.84% 30.12%
## 470: 17.62% 9.84% 30.12%
## 471: 17.73% 9.84% 30.41%
## 472: 17.73% 9.84% 30.41%
## 473: 17.73% 9.84% 30.41%
## 474: 17.85% 10.02% 30.41%
## 475: 17.73% 9.84% 30.41%
## 476: 17.73% 9.84% 30.41%
## 477: 17.85% 10.02% 30.41%
## 478: 17.85% 10.02% 30.41%
## 479: 17.73% 10.02% 30.12%
## 480: 17.85% 10.02% 30.41%
## 481: 17.73% 10.02% 30.12%
## 482: 17.73% 10.02% 30.12%
## 483: 17.73% 10.02% 30.12%
## 484: 17.73% 10.02% 30.12%
## 485: 17.73% 10.20% 29.82%
## 486: 17.73% 10.20% 29.82%
## 487: 17.62% 10.02% 29.82%
## 488: 17.73% 10.20% 29.82%
## 489: 17.62% 10.02% 29.82%
## 490: 17.73% 10.02% 30.12%
## 491: 17.73% 10.20% 29.82%
## 492: 17.73% 10.20% 29.82%
## 493: 17.73% 10.20% 29.82%
## 494: 17.73% 10.02% 30.12%
## 495: 17.73% 10.02% 30.12%
## 496: 17.73% 10.02% 30.12%
## 497: 17.73% 10.20% 29.82%
## 498: 17.85% 10.20% 30.12%
## 499: 17.73% 10.02% 30.12%
## 500: 17.73% 10.02% 30.12%
Let’s now call our model to see what’s in it.
modelrf
##
## Call:
## randomForest(formula = as.factor(PassengerData$Survived) ~ ., data = PassengerData, do.trace = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 17.73%
## Confusion matrix:
## 0 1 class.error
## 0 494 55 0.1001821
## 1 103 239 0.3011696
As you can see, there are 500 trees that have been used to make this forest. Let’s find out the importance of each of the variables.
importance(modelrf)
## MeanDecreaseGini
## PassengerId 62.139964
## Pclass 35.889853
## Age 58.417328
## SibSp 16.154976
## Parch 12.207239
## Fare 71.423022
## female 106.250916
## embarked_c 6.026943
## embarked_s 7.076869
varImpPlot(modelrf)
Using model to make predictions
As can be seen, variable female is of utmost importance for us. Let’s now make the predictions on train dataset.
predictTr <- predict(modelrf, PassengerData)
table(predictTr, PassengerData$Survived)
##
## predictTr 0 1
## 0 546 29
## 1 3 313
Thus, the accuracy for train dataset is (546+314)/(546+28+3+314) = 860/891 = 0.96521. Thus, accuracy on train dataset is 0.96521.
test <- read.csv("C:/Users/jyoti/Downloads/randomForests/test.csv")
summary(test)
## PassengerId Pclass
## Min. : 892.0 Min. :1.000
## 1st Qu.: 996.2 1st Qu.:1.000
## Median :1100.5 Median :3.000
## Mean :1100.5 Mean :2.266
## 3rd Qu.:1204.8 3rd Qu.:3.000
## Max. :1309.0 Max. :3.000
##
## Name Sex
## Abbott, Master. Eugene Joseph : 1 female:152
## Abelseth, Miss. Karen Marie : 1 male :266
## Abelseth, Mr. Olaus Jorgensen : 1
## Abrahamsson, Mr. Abraham August Johannes : 1
## Abrahim, Mrs. Joseph (Sophie Halaut Easu): 1
## Aks, Master. Philip Frank : 1
## (Other) :412
## Age SibSp Parch Ticket
## Min. : 0.17 Min. :0.0000 Min. :0.0000 PC 17608: 5
## 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000 113503 : 4
## Median :27.00 Median :0.0000 Median :0.0000 CA. 2343: 4
## Mean :30.27 Mean :0.4474 Mean :0.3923 16966 : 3
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.0000 220845 : 3
## Max. :76.00 Max. :8.0000 Max. :9.0000 347077 : 3
## NA's :86 (Other) :396
## Fare Cabin Embarked
## Min. : 0.000 :327 C:102
## 1st Qu.: 7.896 B57 B59 B63 B66: 3 Q: 46
## Median : 14.454 A34 : 2 S:270
## Mean : 35.627 B45 : 2
## 3rd Qu.: 31.500 C101 : 2
## Max. :512.329 C116 : 2
## NA's :1 (Other) : 80
Since there are missing values in the test dataset also, we will follow the same series of steps as we had done for train data.
hist(test$Age)
Let’s replace missing values with mean as the distribution is more or less normal in nature.
test$Age[is.na(test$Age)]=30.27
summary(test)
## PassengerId Pclass
## Min. : 892.0 Min. :1.000
## 1st Qu.: 996.2 1st Qu.:1.000
## Median :1100.5 Median :3.000
## Mean :1100.5 Mean :2.266
## 3rd Qu.:1204.8 3rd Qu.:3.000
## Max. :1309.0 Max. :3.000
##
## Name Sex
## Abbott, Master. Eugene Joseph : 1 female:152
## Abelseth, Miss. Karen Marie : 1 male :266
## Abelseth, Mr. Olaus Jorgensen : 1
## Abrahamsson, Mr. Abraham August Johannes : 1
## Abrahim, Mrs. Joseph (Sophie Halaut Easu): 1
## Aks, Master. Philip Frank : 1
## (Other) :412
## Age SibSp Parch Ticket
## Min. : 0.17 Min. :0.0000 Min. :0.0000 PC 17608: 5
## 1st Qu.:23.00 1st Qu.:0.0000 1st Qu.:0.0000 113503 : 4
## Median :30.27 Median :0.0000 Median :0.0000 CA. 2343: 4
## Mean :30.27 Mean :0.4474 Mean :0.3923 16966 : 3
## 3rd Qu.:35.75 3rd Qu.:1.0000 3rd Qu.:0.0000 220845 : 3
## Max. :76.00 Max. :8.0000 Max. :9.0000 347077 : 3
## (Other) :396
## Fare Cabin Embarked
## Min. : 0.000 :327 C:102
## 1st Qu.: 7.896 B57 B59 B63 B66: 3 Q: 46
## Median : 14.454 A34 : 2 S:270
## Mean : 35.627 B45 : 2
## 3rd Qu.: 31.500 C101 : 2
## Max. :512.329 C116 : 2
## NA's :1 (Other) : 80
There is one missing value in fare too.
hist(test$Fare)
Since the variable is skewed, let’s replace the missing value with mean.
test$Fare[is.na(test$Fare)] = 14.454
summary(test)
## PassengerId Pclass
## Min. : 892.0 Min. :1.000
## 1st Qu.: 996.2 1st Qu.:1.000
## Median :1100.5 Median :3.000
## Mean :1100.5 Mean :2.266
## 3rd Qu.:1204.8 3rd Qu.:3.000
## Max. :1309.0 Max. :3.000
##
## Name Sex
## Abbott, Master. Eugene Joseph : 1 female:152
## Abelseth, Miss. Karen Marie : 1 male :266
## Abelseth, Mr. Olaus Jorgensen : 1
## Abrahamsson, Mr. Abraham August Johannes : 1
## Abrahim, Mrs. Joseph (Sophie Halaut Easu): 1
## Aks, Master. Philip Frank : 1
## (Other) :412
## Age SibSp Parch Ticket
## Min. : 0.17 Min. :0.0000 Min. :0.0000 PC 17608: 5
## 1st Qu.:23.00 1st Qu.:0.0000 1st Qu.:0.0000 113503 : 4
## Median :30.27 Median :0.0000 Median :0.0000 CA. 2343: 4
## Mean :30.27 Mean :0.4474 Mean :0.3923 16966 : 3
## 3rd Qu.:35.75 3rd Qu.:1.0000 3rd Qu.:0.0000 220845 : 3
## Max. :76.00 Max. :8.0000 Max. :9.0000 347077 : 3
## (Other) :396
## Fare Cabin Embarked
## Min. : 0.000 :327 C:102
## 1st Qu.: 7.896 B57 B59 B63 B66: 3 Q: 46
## Median : 14.454 A34 : 2 S:270
## Mean : 35.577 B45 : 2
## 3rd Qu.: 31.472 C101 : 2
## Max. :512.329 C116 : 2
## (Other) : 80
Let’s now do feature engineering.
test$female = ifelse(test$Sex=="female", 1, 0)
test$embarked_c = ifelse(test$Embarked=="C", 1, 0)
test$embarked_s = ifelse(test$Embarked=="S", 1, 0)
head(test)
Let’s remove name, sex etc variables as we did in training set.
newtest = data.frame(test)
newtest = newtest[-c(3, 4, 8, 10, 11)]
head(newtest)
Now, we will do the predictions.
newtest$predicted<-predict(modelrf, newtest)
head(newtest$predicted)
## [1] 0 0 0 0 0 0
## Levels: 0 1
Getting ready for submissions
Let’s save the passengerId variable and predictions in a new dataframe.
submission<- data.frame(matrix(nrow=nrow(test)))
submission$PassengerId <- newtest$PassengerId
submission$Survived <- newtest$predicted
submission <- submission[-c(1)]
Let’s now save the dataframe as a csv file.
write.csv(submission, 'submission.csv', row.names = FALSE)
Thanks Jyoti for sharing.
How can we proceed for data preprocessing when there are 1000 variable. What steps need to be followed to manage outlier.
Welcome Ash! The best way to manage outliers is to look at it is to look at the quantiles and box plot of a variable to see if there are outliers and then replacing them with the value corresponding to 96th percentile. Also, you can opt for regularization as it minimizes the effects of outliers. I have made two posts on L2 and L1 regularization. Please go through it. You will be able to handle them using their implementation by deep learning.