Linear Regression Using Python

Hi MLEnthusiasts! In this tutorial, we will learn how to implement linear regression, how to visualize our variables, i.e., do plotting and how to do mathematical computation of R squared using Python.

Please note that this tutorials is based on our previous tutorial “THE MATHEMATICS BEHIND UNIVARIATE LINEAR REGRESSION“. So it is highly recommended to first go through that tutorial to understand this tutorial.

The first step is to import Python libraries like numpy and matplotlib.

In [2]:
import numpy as np
import matplotlib.pyplot as plt 

It should be noted over here that np and plt are used as alias names for numpy and matplotlib.pyplot respectively.
The next step is to make empty vectors for X and Y where X is our input vector and Y is our output vector.

In [3]:
X_input = []           #Making empty list for storing input column
Y_output = []          #Making empty list for storing output column

Now, we will open our file using the following code:

In [5]:
for line in open("data.csv"): #Reading data.csv line by line
    x, y = line.split(',')     #splitting the line using comma: ","
    X_input.append(float(x)) #Appending the float values of x to empty list
    Y_input.append(float(y)) #Appending the float values of y to empty list

What this will do is it will read the .csv file line by line and will make the observation before ‘,’ get written into x and that after ‘,’ written into y for each line. This will, henceforth, generate two vectors x and y with x being input vector and y being output vector. After this, we will convert the values in x and y into float and will then append those values in X and Y. Therefore, X and Y will also be vectors having values of float data type.
The next step is to convert X and Y into numpy arrays since numpy makes it very easy to do mathematical computations on its matrices and vectors.

In [6]:
X_input = np.array(X_input)    #Converting X_input list into numpy array
Y_output = np.array(Y_output)  #Converting Y_output list into numpy array

Now, let’s see how the relationship between X and Y looks like.

In [9]:
plt.scatter(X_input, Y_output)
plt.title('Scatter plot showing relationship between X and Y')
plt.xlabel('values of X')
plt.ylabel('values of Y')
plt.show()

From the above plot, we can see that the relationship is very linear. Now, let’s find out what the values of m and c are!

In [11]:
denominator = X_input.dot(X_input) - X_input.mean()*X_input.sum()
m = (X_input.dot(Y_output) - Y_output.mean()*X_input.sum())/denominator  #Slope of linear regression equation
c = (Y_output.mean()*X_input.dot(X_input) - X_input.mean()*X_input.dot(Y_output))/denominator  #Intercept of linear regression equation
print("Value of slope m is", m)
print("Value of intercept c is", c)
Value of slope m is 1.97261216748
Value of intercept c is 2.86442407566

Let’s now find our predicted values for Y or Yhat using linear regression equation.

In [13]:
Yhat = m*X_input + c

Let’s visualize how far our predictions are from actual values.

In [18]:
plt.scatter(X_input,Y_output)
plt.title('Scatter plot showing relationship between X and Y')
plt.xlabel('values of X')
plt.ylabel('values of Y')
plt.plot(X_input, Yhat, color = 'red')
plt.show()

The dots are the actual values of Y and the line of best fit shows Yhat or the predicted values of Y. We see that predictions are very close to the actual values. Let’s now calculate the R square of this model.

In [17]:
diff1 = Y_output - Yhat  #The error term
diff2 = Y_output - Y_output.mean()
rsquared = 1 - (diff1.dot(diff1)/diff2.dot(diff2)) 
print("R square of the model is", rsquared)
R square of the model is 0.991183820298

The R square turns out to be very close to 1. Thus our model is very good. It is to be noted here that X.dot(X) = sum of square of Xi where i ranges from 1 to N. N is the number of observations.
So guys, with this, we conclude our tutorial. Stay tuned for more interesting tutorials.

Advertisements

One comment

  1. Lovely tutorial, in which you refer to a file data.csv. I’ve read the previous “The Mathematics Behind Univariate Linear Regression”, assuming the file would be referenced in there, but not that I can see. Can you direct me to this file?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s