Linear Regression using R

Hi Everyone! Today, we will learn about building the linear regression model in R using a dataset from kaggle. The dataset is divided into two parts, one is train.csv and the other one is test.csv. We will make our model using train.csv and then test it using test.csv. I am using R notebook in the blog post and the data set has two variables, x and y. x  being independent variable and y being dependent variable.

#Implementing linear regression in R
data <- read.csv("C:/Users/jyoti/Downloads/train.csv")
summary(data)
##        x                 y         
##  Min.   :   0.00   Min.   : -3.84  
##  1st Qu.:  25.00   1st Qu.: 24.93  
##  Median :  49.00   Median : 48.97  
##  Mean   :  54.99   Mean   : 49.94  
##  3rd Qu.:  75.00   3rd Qu.: 74.93  
##  Max.   :3530.16   Max.   :108.87  
##                    NA's   :1

R shows six-point summaries of the variables in the dataframe. min is the minimum value, 1st Qua is the 25%ile value, median is the 50%ile value, 3dr Qua is the 75%ile value, max is the maximum value and mean is the mean of all the observations of the corresponding variable. The summary also shows NA’s or number of missing observations of a particular variable. In x, there is no missing variable but in y, there is 1 missing variable. Next step is to find the correlation between two variables, if any, between x and y. This can be done by using cor() function in R. If you try to find the correlation of two variables having missing values using the cor() function, you will get following as the output:

cor(data)
##    x  y
## x  1 NA
## y NA  1

Because y has 1 missing value, the cor function is giving na as the output. To resolve this, we need to do mean imputation and then find if theere still are any missing values or not using summary function again.

data$y[is.na(data$y)] = 49.94
summary(data)
##        x                 y         
##  Min.   :   0.00   Min.   : -3.84  
##  1st Qu.:  25.00   1st Qu.: 24.99  
##  Median :  49.00   Median : 49.10  
##  Mean   :  54.99   Mean   : 49.94  
##  3rd Qu.:  75.00   3rd Qu.: 74.88  
##  Max.   :3530.16   Max.   :108.87

Thus, there are no missing values now. Let’s apply cor function now.

cor(data)
##           x         y
## x 1.0000000 0.2138303
## y 0.2138303 1.0000000

The output shows that x and y are only 21.38% correlated. Let’s do some visual analysis now.Let’s use par function and generate two plots of x and y and show then in 1 row and two columns.

par(mfrow = c(1,2))
hist(data$x, main = "distribution of x")
hist(data$y, main = "distribution of y")

Thus, the plot of y is approx normally distributed, but not of x. Let’s generate boxplot of x and y now.

box1 = boxplot(data$x)

This shows that there are outliers in the x variable. Also, let’s generate the boxplot of y.

box2 = boxplot(data$y)

Thus, y doesn’t have enough outliers. Now, we will need to handle the outliers because presence of outliers may not give us the desired result! Let’s gerate quantile analysis of x.

quantile(data$x, seq(0, 1, 0.02))
##       0%       2%       4%       6%       8%      10%      12%      14% 
##    0.000    2.000    4.960    7.000    8.000   10.000   13.000   15.860 
##      16%      18%      20%      22%      24%      26%      28%      30% 
##   17.000   19.000   21.000   23.780   25.000   26.000   27.000   29.700 
##      32%      34%      36%      38%      40%      42%      44%      46% 
##   31.680   34.000   36.000   38.000   41.000   42.000   44.560   47.000 
##      48%      50%      52%      54%      56%      58%      60%      62% 
##   48.000   49.000   51.000   52.000   54.000   55.420   58.000   59.000 
##      64%      66%      68%      70%      72%      74%      76%      78% 
##   61.000   64.340   67.000   70.000   71.000   74.000   75.240   78.000 
##      80%      82%      84%      86%      88%      90%      92%      94% 
##   81.000   84.000   86.000   88.000   89.120   91.000   93.000   95.000 
##      96%      98%     100% 
##   98.000   99.000 3530.157

We can see that the 100% value is the main outlier in case of x. Let’s handle this by generating a new dataframe.

finalData = data
finalData$x = ifelse(data$x > 99, 99, data$x)

Now, let’s do the same for y.

quantile(data$y, seq(0, 1, 0.02))
##         0%         2%         4%         6%         8%        10% 
##  -3.839981   1.400143   4.027188   6.814947   8.445068  10.264419 
##        12%        14%        16%        18%        20%        22% 
##  12.109471  14.094909  17.317910  19.404504  21.197887  22.537589 
##        24%        26%        28%        30%        32%        34% 
##  24.511020  25.690335  27.199616  29.151251  31.762465  33.670006 
##        36%        38%        40%        42%        44%        46% 
##  36.302391  37.980422  41.095325  43.222012  45.054610  46.912769 
##        48%        50%        52%        54%        56%        58% 
##  48.000888  49.095828  50.120910  52.376437  53.592696  55.658733 
##        60%        62%        64%        66%        68%        70% 
##  57.273872  59.193245  61.418188  63.559271  66.347003  69.898164 
##        72%        74%        76%        78%        80%        82% 
##  71.700767  73.746075  75.523592  78.242993  80.933215  83.555818 
##        84%        86%        88%        90%        92%        94% 
##  86.107227  86.987577  88.885099  91.170832  92.890793  94.641859 
##        96%        98%       100% 
##  96.196669  99.406205 108.871618

We can see that there is no enough presence of outliers. Now, let’s find the correlation between x and y of finalData.

cor(finalData)
##           x         y
## x 1.0000000 0.9932878
## y 0.9932878 1.0000000

As you can see, after missing value imputation and handling of outliers, we get 99.328% positive correlation between x and y. Let’s verify this by generating a scatter plot with x as independent variable and y as dependent variable.

plot(finalData$x, finalData$y, main = "Scatter plot of x and y")

Thus, x and y are positively correlated or as x increases, y also increases. Let’s now make a model.

model = lm(y~x, data = finalData)
summary(model)
## 
## Call:
## lm(formula = y ~ x, data = finalData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.797  -1.960   0.101   1.922  10.135 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.003116   0.254250   0.012     0.99    
## x           0.997310   0.004396 226.875   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.367 on 698 degrees of freedom
## Multiple R-squared:  0.9866, Adjusted R-squared:  0.9866 
## F-statistic: 5.147e+04 on 1 and 698 DF,  p-value: < 2.2e-16

The equation we get is y = 0.997310*x + 0.003116. 98.66% of R-squared implies that 98.66% variation in y is explained by x. Intercept of 0.003116 says that there is only this much about which can’t be explained and this gives our model as highly reliable. Let’s now see if the residuals satisfy the assumption of linear regression.

library(car)
## Loading required package: carData
durbinWatsonTest(model)
##  lag Autocorrelation D-W Statistic p-value
##    1      -0.0105545      2.019993   0.716
##  Alternative hypothesis: rho != 0

As you can see, D-W statistic value is 2.019993 which states that there is no autocorrelation among the residuals. D-W value lies between 0 to 4. Value of 2 states no autocorrelation, less than 2 states positive autocorrelation and more than 2 states negative autocorrelation. Let’s now see if the residuals are normally distributed or not.

hist(residuals(model))

The residuals are approx normally distributed. Now see if the distribution between y and residuals are uniformly distributed or not.

plot(finalData$y, residuals(model))

The distribution is uniform in nature. Thus, all the assumptions of linear regression are satisfied. Let’s now use this model to predict the values in the test.csv.

newData = read.csv("C:/Users/jyoti/Downloads/test.csv")
newData$predicted = predict(model, newData)
summary(newData)
##        x                y             predicted       
##  Min.   :  0.00   Min.   : -3.468   Min.   : 0.00312  
##  1st Qu.: 27.00   1st Qu.: 25.677   1st Qu.:26.93050  
##  Median : 53.00   Median : 52.171   Median :52.86056  
##  Mean   : 50.94   Mean   : 51.205   Mean   :50.80278  
##  3rd Qu.: 73.00   3rd Qu.: 74.303   3rd Qu.:72.80677  
##  Max.   :100.00   Max.   :105.592   Max.   :99.73415

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s