Linear Regression steps
Hi MLEnthusiasts!
Today, we will discuss the steps to perform linear regression and in subsequent articles, we will discuss how to perform them by means of casestudies!
 Understand the problem statement.
 Generate the hypothesis
Making the linear regression model

Step 1: Data Preparation
 Import data
 View data
 Check the summary of data
 If the data variables have missing values, do missing value imputation
 Missing value imputation: First look at the histogram of each of the data variables
 If the histogram shows normal distribution, replace the missing values with mean (only for continuous variable)
 In case, it shows the skewed distribution, replace the missing values with median (only for continuous variable)
 If the variable is categorical, do the imputation with mode.
 We can even make a predictive model to impute missing data in a variable. Here, we will treat the variable having missing data as target variable and the other variable as predictors. We’ll divide the data into 2 sets – one without any missing value for that variable and the other with missing values for that variable. The former set would be used as training set to build the predictive model and it would then be applied to the latter set to predict the missing values.

Step 2: Data Manipulation
 For categorical independent variables, one can go for many types of encodings – people often choose dummy variables creation for ordinal variables – categorical variables with some order. For a categorical variable having n categories, there should be n1 dummy variables since the left one variable is taken care of by the intercept.
 There is also one more type of encoding – One Hot encoding, each category of a categorical variable is converted into a new binary column (1/0).

Step 3: Univariate Analysis
 Now, do outlier manipulation for each independent as well as dependent variable. Outliers lead to multicollinearity and also deteriorate the model. For doing outlier manipulation, one can choose the threshold as +/ 4 percentile since it doesn’t lead to severe loss of the data.
 Check the distribution of each of the independent variables. If its skewed, it must be transformed. If it’s normal, no transformation required!

Step 4: Bivariate Analysis
 Now, go for seeing the relationship between the dependent and each of the independent variables onebyone. Scatter plot is the best means for that. This lets the person know about the relationship between them. If the relationship is linear, good, if its curvilinear, go for log transformation!

Step 5: Linear Regression Analysis
 Go for correlation analysis first. See if the independent variables have high correlation among eachother or not. If they have, it can lead to multicollinearity! They should be having high correlation with dependent variable but not among themselves!
 Check the multicollinearity: Use the variance inflation factor (vif function) to achieve this. It gives the multicollinearity value only. If it’s less than 5 for a variable keep it in the model, else discard it or exclude it from the model.
 Now, make a model for the new list of variables, and check vif again!
 If you want to eliminate this list of steps then use step function directly. It does forward/backward propagation plus taken care of multicollinearity also.
 Discard the variables having pvalue higher than .05, they are least significant variables.

Step 6: Model Evaluation
 Evaluation Metrics : Mean Absolute Error, Mean Squared Error and Root Mean Squared Error
 After getting the final model from the previous steps, go for checking whether or not the model satisfies the assumptions of linear model.
 Assumptions of Linear Regression:
 The relationship between independent and dependent variables must be linear. Check this by means of the scatterplots.
 The residuals should be normally distributed. Residuals = Observed value – Predicted value (Fitted value)
 Multicollinearity should not be present. Calculate the get an indication about the multicollinearity values.
 Homoscedasticity must be present, i.e., the relationship between the residuals and response variable (Predicted variable) should be uniform.
 Checking if the model satisfies the assumptions
 Autocorrelation test: Use Durbin Watson Test on the model.
 The Durbin Watson statistic is a test for autocorrelation in a data set.
 The DW statistic always has a value between zero and 4.0.
 A value of 2.0 means there is no autocorrelation detected in the sample. Values from zero to 2.0 indicate positive autocorrelation and values from 2.0 to 4.0 indicate negative autocorrelation.
 Autocorrelation can be useful in technical analysis, which is most concerned with the trends of security prices using charting techniques in lieu of a company’s financial health or management.
 For our purposes, a value less than 2 is generally preferred.
 DW = 2(1r) where r is the correlation value.
 Checking normality of errors: Go for seeing the histogram of residuals. Residuals should be normally distributed. This can be checked by visualizing QQ Normal plot. If points lie exactly on the line, it is perfectly normal distribution. However, some deviation is to be expected, particularly near the ends, but the deviations should be small.
 Homoscedasticity: Check the scatterplot between the residuals and response variable. It should be uniform.
 Check cook’s distance. Observations having high cook’s distance values should be removed and model should be remade.
 Autocorrelation test: Use Durbin Watson Test on the model.

Step 7: Validating the model
 Kfold cross Validation: to calculate the average of k recorded errors also known as crossvalidation error. It serves as a performance metric for the model.
 Using regularized regression models : to handle the correlated independent variables well and to overcome overfitting.

 Ridge penalty shrinks the coefficients of correlated predictors towards eachother
 Lasso tends to pick one of a pair of correlated features and discard the other.
 The tuning parameter lambda controls the strength of the penalty.

 Using regressive random forests to carry out regression
 Boosting: To improve the accuracy of the model

Step 8: Prediction by linear regression model!
More articles on Linear regression – Linear regression using Python, Mathematics behind Univariate Linear Regression and Multivariate Linear Regression with R.
For YouTube tutorials, go to our channel.