DS Concepts DS Languages

Data Cleaning

Data cleaning – Introduction

While doing data data analysis, before analyzing the data for any kind of resultant, we most perform the most essential step, that is, Data Cleaning. Because when we get the data it may not be ready for complete analysis, like there might me missing values which will make our judgement wrong, outliers which may affect the Mean, Median; variable may spread across the values of a column (need to convert that column into multiple variables. i.e., columns) and many more. So before performing any kind of analysis for getting results by using any means like Linear Regression, we must do data cleaning for getting the accurate results.

Working on dataset for data cleaning

Let’s use one of the data-sets being provided in RStudio, that is, “SURVEY” for data cleaning purposes. It consists of variables like “Age”, “Pulse rate”, “Gender(Sex)”, etc.

library(MASS)
View(survey)
head(survey, 10)
##       Sex Wr.Hnd NW.Hnd W.Hnd    Fold Pulse    Clap Exer Smoke Height
## 1  Female   18.5   18.0 Right  R on L    92    Left Some Never 173.00
## 2    Male   19.5   20.5  Left  R on L   104    Left None Regul 177.80
## 3    Male   18.0   13.3 Right  L on R    87 Neither None Occas     NA
## 4    Male   18.8   18.9 Right  R on L    NA Neither None Never 160.00
## 5    Male   20.0   20.0 Right Neither    35   Right Some Never 165.00
## 6  Female   18.0   17.7 Right  L on R    64   Right Some Never 172.72
## 7    Male   17.7   17.7 Right  L on R    83   Right Freq Never 182.88
## 8  Female   17.0   17.3 Right  R on L    74   Right Freq Never 157.00
## 9    Male   20.0   19.5 Right  R on L    72   Right Some Never 175.00
## 10   Male   18.5   18.5 Right  R on L    90   Right Some Never 167.00
##         M.I    Age
## 1    Metric 18.250
## 2  Imperial 17.583
## 3       16.917
## 4    Metric 20.333
## 5    Metric 23.667
## 6  Imperial 21.000
## 7  Imperial 18.833
## 8    Metric 35.833
## 9    Metric 19.000
## 10   Metric 22.333
summary(survey)
##      Sex          Wr.Hnd          NW.Hnd        W.Hnd          Fold    
##  Female:118   Min.   :13.00   Min.   :12.50   Left : 18   L on R : 99  
##  Male  :118   1st Qu.:17.50   1st Qu.:17.50   Right:218   Neither: 18  
##  NA's  :  1   Median :18.50   Median :18.50   NA's :  1   R on L :120  
##               Mean   :18.67   Mean   :18.58                            
##               3rd Qu.:19.80   3rd Qu.:19.73                            
##               Max.   :23.20   Max.   :23.50                            
##               NA's   :1       NA's   :1                                
##      Pulse             Clap       Exer       Smoke         Height     
##  Min.   : 35.00   Left   : 39   Freq:115   Heavy: 11   Min.   :150.0  
##  1st Qu.: 66.00   Neither: 50   None: 24   Never:189   1st Qu.:165.0  
##  Median : 72.50   Right  :147   Some: 98   Occas: 19   Median :171.0  
##  Mean   : 74.15   NA's   :  1              Regul: 17   Mean   :172.4  
##  3rd Qu.: 80.00                            NA's :  1   3rd Qu.:180.0  
##  Max.   :104.00                                        Max.   :200.0  
##  NA's   :45                                            NA's   :28     
##        M.I           Age       
##  Imperial: 68   Min.   :16.75  
##  Metric  :141   1st Qu.:17.67  
##  NA's    : 28   Median :18.58  
##                 Mean   :20.37  
##                 3rd Qu.:20.17  
##                 Max.   :73.00  
## 

 

Description of data-set is as follows:

Sex           <- The sex of the student (Factor with levels “Male” and “Female”.)
Wr.Hnd   <- span (distance from tip of thumb to tip of little finger of spread hand) of writing hand, in centimetres.
NW.Hnd  <- span of non-writing hand
W.Hnd     <- writing hand of student (Factor, with levels “Left” and “Right”.)
Fold          <- Fold your arms! Which is on top (Factor, with levels “R on L”, “L on R”, “Neither”.)
Pulse        <- pulse rate of student (beats per minute)
Clap         <- Clap your hands! Which hand is on top (Factor, with levels “Right”, “Left”, “Neither”.)
Exer        <- how often the student exercises (Factor, with levels “Freq” (frequently), “Some”, “None”.)
Smoke     <- how much the student smokes (Factor, levels “Heavy”, “Regul” (regularly), “Occas” (occasionally), “Never”.)
Height     <- height of the student in centimetres
M.I.           <- whether the student expressed height in imperial (feet/inches) or metric (centimetres/metres) units. (Factor, levels “Metric”, “Imperial”.)
Age          <- age of the student in years

By “View()” method above the data set opens separately in RStudio for better view and we can see the data type as well of each variable by hovering onto it.

By “summary()” we got the summary of data like “Min value”, “Max value”, “Mean”, “Frequency”, etc.

The main things that can be interpreted from above summary are:

  1. Categorical Variables: Sex, W.Hnd, Fold, Clap, Exer, Smoke, M.I
  2. Numerical Variables: Wr.Hnd, NW.Hnd, Pulse, Height, Age
  3. Presence of “NA” –> which means that our data-set contains missing or undefined values. While visualising the data using View() we can verify that there are a lot of missing values.
  4. Other factors like –> Frequecny in case of categorical variable and for Numeric variables we get Mean, Median, Min, Max, 1st and 3rd Quartile.

?? Now the first question would have aroused in your mind that if the data is missing then how we can do proper analysis of data. You would be thinking how we can rplace all “NA” with some data. For that we generally use Mean, Median and Mode.

Filling Up Missing Values:

Since we have worked upon the outliers now we can work upon the missing values. For removing the gaps in information we can replace “NA” with either “Mean”, “Median” or “Mode”, depending upon the scenarios.

Scenarios for Raplcaing Missing Values:

  1. If variable is “Continuous” –> Replace NA with any either “Mean”, “Median” or “Mode”.
  2. Otherwise
  1. If data “Normally Distributed” –> Replace NA with “Mean”.
  2. if data “Skewd” –> Replace NA with “Median”.
  1. If “Categorical” –> Replace NA with “Mode”.

** For “Non-continuous Numerical” data we will be using “Histogram” to check distribution of data (Normal or Skewd).

Working on Survey Data-Set:

  1. For variable “Sex” It is a Categorical variable so we will use Mode.
survey$Sex[is.na(survey$Sex)] <- "Female"
summary(survey$Sex)
## Female   Male 
##    119    118
  1. For variable “Wr.Hnd” It is a Non-continuous Numerical Variable so first we will look for distribution of data, by using histogram.
hist(survey$Wr.Hnd, main="Wr.Hnd")

 

hist_wr_hnd

In above histogram we can see that the data is Right-Skewd, so we will replace “NA” with “Median”

survey$Wr.Hnd[is.na(survey$Wr.Hnd)] <- median(survey$Wr.Hnd, na.rm=TRUE)
summary(survey$Wr.Hnd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   17.50   18.50   18.67   19.80   23.20
  1. For variable “NW.Hnd” It is a Non-continuous Numerical Variable so first we will look for distribution of data, by using histogram.
hist(survey$NW.Hnd, main="NW.Hnd")

 

hist_nw_hnd

In above histogram we can see that the data is approximately “Right Skewd”, so we will replace “NA” with “Median”

survey$NW.Hnd[is.na(survey$NW.Hnd)] <- median(survey$NW.Hnd, na.rm=TRUE)
summary(survey$NW.Hnd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.50   17.50   18.50   18.58   19.70   23.50
  1. For variable “W.Hnd” It is a Categorical variable so we will use Mode.
survey$W.Hnd[is.na(survey$W.Hnd)] <- "Right"
summary(survey$W.Hnd)
##  Left Right 
##    18   219
  1. For variable “Pulse” It is a Non-continuous Numerical Variable so first we will look for distribution of data, by using histogram.
hist(survey$Pulse, main="Pulse")

 

hist_pulse

In above histogram we can see that the data is approximately “Normally Distributed”, so we will replace “NA” with “Mean”

survey$Pulse[is.na(survey$Pulse)] <- mean(survey$Pulse, na.rm=TRUE)
summary(survey$Pulse)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.00   68.00   74.15   74.15   80.00  104.00
  1. For variable “Clap” It is a Categorical variable so we will use Mode.
survey$Clap[is.na(survey$Clap)] <- "Right"
summary(survey$Clap)
##    Left Neither   Right 
##      39      50     148
  1. For variable “Smoke” It is a Categorical variable so we will use Mode.
survey$Smoke[is.na(survey$Smoke)] <- "Never"
summary(survey$Smoke)
## Heavy Never Occas Regul 
##    11   190    19    17
  1. For variable “Height” It is a Non-continuous Numerical Variable so first we will look for distribution of data, by using histogram.
hist(survey$Height, main="Height")

 

hist_height.png

In above histogram we can see that the data is approximately “Right Skewd”, so we will replace “NA” with “median”

survey$Height[is.na(survey$Height)] <- median(survey$Height, na.rm=TRUE)
summary(survey$Height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   150.0   167.0   171.0   172.2   179.0   200.0
  1. For variable “M.I” It is a Categorical variable so we will use Mode.
survey$M.I[is.na(survey$M.I)] <- "Metric"
summary(survey$M.I)
## Imperial   Metric 
##       68      169
summary(survey)
##      Sex          Wr.Hnd          NW.Hnd        W.Hnd          Fold    
##  Female:119   Min.   :13.00   Min.   :12.50   Left : 18   L on R : 99  
##  Male  :118   1st Qu.:17.50   1st Qu.:17.50   Right:219   Neither: 18  
##               Median :18.50   Median :18.50               R on L :120  
##               Mean   :18.67   Mean   :18.58                            
##               3rd Qu.:19.80   3rd Qu.:19.70                            
##               Max.   :23.20   Max.   :23.50                            
##      Pulse             Clap       Exer       Smoke         Height     
##  Min.   : 35.00   Left   : 39   Freq:115   Heavy: 11   Min.   :150.0  
##  1st Qu.: 68.00   Neither: 50   None: 24   Never:190   1st Qu.:167.0  
##  Median : 74.15   Right  :148   Some: 98   Occas: 19   Median :171.0  
##  Mean   : 74.15                            Regul: 17   Mean   :172.2  
##  3rd Qu.: 80.00                                        3rd Qu.:179.0  
##  Max.   :104.00                                        Max.   :200.0  
##        M.I           Age       
##  Imperial: 68   Min.   :16.75  
##  Metric  :169   1st Qu.:17.67  
##                 Median :18.58  
##                 Mean   :20.37  
##                 3rd Qu.:20.17  
##                 Max.   :73.00

By following the above process we raplaced all the missing values with some meaningful data, as you can see in the “summary()” of data there are no “NA”s. Now our data is much cleaner than the one we started with, and it looks good for further analysis. Now the next thing that we need to handle is “Outlier”

Outliers

Outliers are the data values which deviate a lot from most of the data values, either too big or too small in comparison to average values. Let us understand using an example…We will use “boxplot()” for finding the outliers.

#We will create it only for numeric variables not for categorical data

boxplot(survey$Wr.Hnd, main="Wr.Hnd")

 

1_box_wr_hnd

boxplot(survey$NW.Hnd, main="NW.Hnd")

 

1_box_nw_hnd

boxplot(survey$Pulse, main="Pulse")

 

1_box_pulse.png

boxplot(survey$Height, main="Height")

 

1_box_height

boxplot(survey$Age, main="Age")

 

1_box_age

From above boxplot of “survey” data-set we can see that for variables Wr.Hnd, NW.Hnd, Pulse and Age has got some dots or circles in the graph, either above or below the slim lines (called whiskers, 1st and 3rd Quartile of boxplot). Those circles represents the presence of outliers and number of outliers OR we can detect outliers in one more way, that is, all the values out of range of 5th and 95th percentile can be considered as Outlier.

Removing Outliers

We will use the second method for removing the outliers, because in case of boxplot we may endup loosing some crucial information too, as range being produced by the 1st and 3rd quartile is generally smaller in comparison to the range being produced by 5th and 95th percentile. For example:

  1. Range generated by Quartiles
#For variable PULSE
## Range generated by Quartiles
## (-1.5 * IQR) to (1.5 * IQR), where (Inerquartile Range) IQR <- Q3 -Q1
#Q1 and Q3 are present in summary of Pulse, that we obtained at the beginning

Q3 <- 80
Q1 <- 66
IQR <- Q3 -Q1 
lower <- -1.5 * IQR
upper <- 1.5 * IQR

lower
## [1] -21
upper
## [1] 21
  1. Range generated by 5th and 95th Percentiles
## Range generated by 5th and 95th Percentiles
## For this we use quantile(, ) method

lower <- quantile(survey$Pulse, 0.05, na.rm=TRUE)
upper <- quantile(survey$Pulse, 0.95, na.rm=TRUE)

lower
## 5% 
## 60
upper
## 95% 
##  92

So we can see the significant difference in the range of values. Thus we will be useing Percentile approach for capping the outliers. ** Another thing to be noted is that in case of Percentiles we are using level of significance, so per our requirement we can change it too. In general we use 5th and 95th percentiles.

After detecting outliers for a variable we will replace it with the Mean of the data. We will do the same procedure for all the variables containing outliers.

  1. For variable Pulse
quantile(survey$Pulse, 0.05, na.rm=TRUE)
## 5% 
## 60
quantile(survey$Pulse, 0.95, na.rm=TRUE)
## 95% 
##  92
survey$Pulse <- ifelse(survey$Pulse > 92, 74.22, survey$Pulse)
survey$Pulse <- ifelse(survey$Pulse < 60, 74.22, survey$Pulse)

summary(survey$Pulse)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   60.00   70.00   74.15   74.25   78.00   92.00
boxplot(survey$Pulse, main="Pulse_Without_Outliers")

2_box_pulse

** So what happened here?? We are still able to see dots in boxplot!! Answer to this is though there are dots in boxplot, but we cann’t remove more data than that, because it will result in forging data, which will hamper the correct analysis of data. This happens a lot when you are working on large dataset, so keep in mind that not all the outliers can be removed

** Though if we don’t want them we can simply delete the data but it will result in data loss. So its not a good practice to delete them.

Now we will repeat same process for other three variables too.

  1. For variable Wr.Hnd
quantile(survey$Wr.Hnd, 0.05, na.rm=TRUE)
## 5% 
## 16
survey$Wr.Hnd <- ifelse(survey$Wr.Hnd < 16, 18.84, survey$Wr.Hnd)

summary(survey$Wr.Hnd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   17.50   18.60   18.84   19.80   23.20
boxplot(survey$Wr.Hnd, main="Wr.Hnd_Without_Outliers")

 

2_box_wr_hnd

  1. For variable NW.Hnd
quantile(survey$NW.Hnd, 0.05, na.rm=TRUE)
##   5% 
## 15.5
quantile(survey$NW.Hnd, 0.95, na.rm=TRUE)
##   95% 
## 22.22
survey$NW.Hnd <- ifelse(survey$NW.Hnd > 22.22, 18.53, survey$NW.Hnd)
survey$NW.Hnd <- ifelse(survey$NW.Hnd < 15.5, 18.53, survey$NW.Hnd)

summary(survey$NW.Hnd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.50   17.50   18.50   18.52   19.50   22.20
boxplot(survey$NW.Hnd, main="NW.Hnd_Without_Outliers")

 

2_box_nw_hnd.png

  1. For variable Age
quantile(survey$Age, 0.95, na.rm=TRUE)
##     95% 
## 30.6836
survey$Age <- ifelse(survey$Age > 30.6836, 19.22, survey$Age)

summary(survey$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.75   17.67   18.58   19.17   19.83   30.67
boxplot(survey$Age, main="Age_Without_Outliers")

 

2_box_age

This is the same scenario that we faced while removing outliers from variable Pulse, but here the number of outliers is high. So in this scenario you can omit the data containing outliers for better results.

That’s all for this lecture, in next lecture we look into the things in a much deeper way.

Keep Learning!!!

2 thoughts on “Data Cleaning

Leave a Reply

Back To Top
%d bloggers like this: