DS Concepts

Linear Regression Steps

Linear Regression steps

Hi MLEnthusiasts!

Today, we will discuss the steps to perform linear regression and in subsequent articles, we will discuss how to perform them by means of case-studies!

  • Understand the problem statement.
  • Generate the hypothesis

Making the linear regression model

  • Step 1: Data Preparation

    • Import data
    • View data
    • Check the summary of data
    • If the data variables have missing values, do missing value imputation
    • Missing value imputation: First look at the histogram of each of the data variables
      • If the histogram shows normal distribution, replace the missing values with mean (only for continuous variable)
      • In case, it shows the skewed distribution, replace the missing values with median (only for continuous variable)
      • If the variable is categorical, do the imputation with mode.
      • We can even make a predictive model to impute missing data in a variable. Here, we will treat the variable having missing data as target variable and the other variable as predictors. We’ll divide the data into 2 sets – one without any missing value for that variable and the other with missing values for that variable. The former set would be used as training set to build the predictive model and it would then be applied to the latter set to predict the missing values.
  • Step 2: Data Manipulation

    • For categorical independent variables, one can go for many types of encodings – people often choose dummy variables creation for ordinal variables – categorical variables with some order. For a categorical variable having n categories, there should be n-1 dummy variables since the left one variable is taken care of by the intercept.
    • There is also one more type of encoding – One Hot encoding, each category of a categorical variable is converted into a new binary column (1/0).
  • Step 3: Univariate Analysis

    • Now, do outlier manipulation for each independent as well as dependent variable. Outliers lead to multi-collinearity and also deteriorate the model. For doing outlier manipulation, one can choose the threshold as +/- 4 percentile since it doesn’t lead to severe loss of the data.
    • Check the distribution of each of the independent variables. If its skewed, it must be transformed. If it’s normal, no transformation required!
  • Step 4: Bivariate Analysis

    • Now, go for seeing the relationship between the dependent and each of the independent variables one-by-one. Scatter plot is the best means for that. This lets the person know about the relationship between them. If the relationship is linear, good, if its curvilinear, go for log transformation!
  • Step 5: Linear Regression Analysis

    • Go for correlation analysis first. See if the independent variables have high correlation among each-other or not. If they have, it can lead to multi-collinearity! They should be having high correlation with dependent variable but not among themselves!
    • Check the multi-collinearity: Use the variance inflation factor (vif function) to achieve this. It gives the multi-collinearity value only. If it’s less than 5 for a variable keep it in the model, else discard it or exclude it from the model.
    • Now, make a model for the new list of variables, and check vif again!
    • If you want to eliminate this list of steps then use step function directly. It does forward/backward propagation plus taken care of multicollinearity also.
    • Discard the variables having p-value higher than .05, they are least significant variables.
  • Step 6: Model Evaluation

    • Evaluation Metrics : Mean Absolute Error, Mean Squared Error and Root Mean Squared Error
    • After getting the final model from the previous steps, go for checking whether or not the model satisfies the assumptions of linear model.
    • Assumptions of Linear Regression:
      • The relationship between independent and dependent variables must be linear. Check this by means of the scatterplots.
      • The residuals should be normally distributed. Residuals = Observed value – Predicted value (Fitted value)
      • Multicollinearity should not be present. Calculate the get an indication about the multi-collinearity values.
      • Homoscedasticity must be present, i.e., the relationship between the residuals and response variable (Predicted variable) should be uniform.
    • Checking if the model satisfies the assumptions
      • Autocorrelation test: Use Durbin Watson Test on the model.
        • The Durbin Watson statistic is a test for autocorrelation in a data set.
        • The DW statistic always has a value between zero and 4.0.
        • A value of 2.0 means there is no autocorrelation detected in the sample. Values from zero to 2.0 indicate positive autocorrelation and values from 2.0 to 4.0 indicate negative autocorrelation.
        • Autocorrelation can be useful in technical analysis, which is most concerned with the trends of security prices using charting techniques in lieu of a company’s financial health or management.
        • For our purposes, a value less than 2 is generally preferred.
        • DW = 2(1-r) where r is the correlation value.
      • Checking normality of errors: Go for seeing the histogram of residuals. Residuals should be normally distributed. This can be checked by visualizing Q-Q Normal plot. If points lie exactly on the line, it is perfectly normal distribution. However, some deviation is to be expected, particularly near the ends, but the deviations should be small.
      • Homoscedasticity: Check the scatterplot between the residuals and response variable. It should be uniform.
        • Check cook’s distance. Observations having high cook’s distance values should be removed and model should be remade.
  • Step 7: Validating the model

    • K-fold cross Validation: to calculate the average of k recorded errors also known as cross-validation error. It serves as a performance metric for the model.
    • Using regularized regression models : to handle the correlated independent variables well and to overcome overfitting.
        • Ridge penalty shrinks the co-efficients of correlated predictors towards each-other
        • Lasso tends to pick one of a pair of correlated features and discard the other.
        • The tuning parameter lambda controls the strength of the penalty.
    • Using regressive random forests to carry out regression
    • Boosting: To improve the accuracy of the model
  • Step 8: Prediction by linear regression model!

 

More articles on Linear regression – Linear regression using Python, Mathematics behind Univariate Linear Regression and Multivariate Linear Regression with R.

For YouTube tutorials, go to our channel.

 

Leave a Reply

Back To Top
%d bloggers like this: