Hi Everyone! Today, we will learn about ridge regression, the mathematics behind ridge regression and how to implement it using Python!
To build a great foundation on the basics, let’s understand few points given below:
- Firstly, let us have a look at the Sum of square of errors function, that is defined as
- It is also important to note that the first requirement that should be fulfilled for any data set that we want to use for making machine learning models is that the data points should be random in nature and data size should be large.
- But this requirement is not fulfilled sometimes. That is, in some cases, number of features/dimensions(D) is greater than the number of samples/observations(N). Thus, the data set becomes fatty(D >> N) in nature instead of skinny(D << N).
- One thing to be noted is that even completely random noise can also improve R squared. But, this is very unwanted. We don’t want to let noise or unwanted features alter our outputs. This can be achieved by means of regularization.
- In case of L2 regularization, the values of weights are generally kept small, so the affect of noise is generally minimized.
- For any data set which is random in nature, it should follow Gaussian distribution.
- Any Gaussian distribution is defined by its mean, µ and variance,
and is represented by
where X is the input matrix.
- For any point xi, the probability of xi is given by the expression
.
- Also, because the occurrence of each xi is independent to the occurrence of other, the joint probability of each of them is given by
- Also, linear regression is the solution which gives the maximum likelihood to the line of best fit.
- Now, the question arises, what is likelihood? Likelihood is defined as the probability of data X given a parameter of interest, in our case, it’s µ. So, likelihood function is defined as
.
- Linear regression maximizes this function for the sake of finding the line of best fit. We do this by finding the value of µ for which this function is maximized and we can say that it is very likely that our data has come from a population that has µ as mean.
- For solving this, first we take the natural log of the likelihood function(L), then differentiate L wrt µ and then equate this to zero.
Hence. at this value of µ, the likelihood function is maximized.
- One thing to be noted here is that maximizing likelihood function L is equivalent to minimizing error function E. Also, y is Gaussian distributed with mean transpose(w)*X and variance sigma-square or
or
where ε is Gaussian distributed noise with zero mean and sigma-square variance.
- This is equivalent to saying that in linear regression, errors are Gaussian and the trend is linear.
- Now, let’s understand why Ridge regression or L2 regularization was introduced. The answer is outliers! In the presence of outliers, the linear regression gets the line of best fit diverted from the real trend. This is because it follows the method of least squares and in order to minimize the error, it makes the trend line bent towards the outliers. This makes the prediction less accurate and far from what could be in the absence of outliers. To handle this problem, the method of L2 Regularization or Ridge regression was introduced.
- To compensate for this, we modify the cost function and penalize the large weights in the following way
- Plain squared error maximizes likelihood as shown above. But now, since we have two terms in the cost function, we no longer do this. We now have two probabilities, one is likelihood probability and other one is prior.
- Note: In the above images, it’s not posterior but it’s likelihood.
- P(w) is called prior because it represents our prior beliefs about w. Thus, now, J is proportional to -ln(P(Y|X, w))-ln(P(w)). Also, by Baye’s rule, we get, P(w|Y,X) is proportional to P(Y|X,w)*P(w). P(w|Y,X) is called the Posterior probability. The method of maximizing P(w|Y,X) is called Maximizing A Posterior or MAP.
- This method encourages the weights to be small as P(w) is a Gaussian centered around 0. The above value of w is called the MAP estimate of w.
Now, let’s see how to implement Ridge regression or L2 regularization in Python.
In [14]:
import numpy as np #importing the numpy package with alias np
import matplotlib.pyplot as plt #importing the matplotlib.pyplot as plt
In [15]:
No_of_observations = 50 #Setting number of observation = 50
In [16]:
X_input = np.linspace(0,10,No_of_observations) #Generating 50 equally-spaced data points between 0 to 10.
Y_output = 0.5*X_input + np.random.randn(No_of_observations) #setting Y_outputi = 0.5X_inputi + some random noise
In [17]:
Y_output[-1]+=30 #setting last element of Y_output as Y_output + 30
Y_output[-2]+=30 #setting second last element of Y_output as Y_output + 30
In [18]:
plt.scatter(X_input, Y_output)
plt.title('Relationship between Y and X[:, 1]')
plt.xlabel('X[:, 1]')
plt.ylabel('Y')
plt.show()
In [19]:
X_input = np.vstack([np.ones(No_of_observations), X_input]).T #appending bias data points colummn to X
In [20]:
w_maxLikelihood = np.linalg.solve(np.dot(X_input.T, X_input), np.dot(X_input.T, Y_output)) #finding weights for maximum likelihood estimation
Y_maxLikelihood = np.dot(X_input, w_maxLikelihood) #Finding predicted Y corresponding to w_ml
In [21]:
plt.scatter(X_input[:,1], Y_output)
plt.plot(X_input[:,1],Y_maxLikelihood, color='red')
plt.title('Graph of maximum likelihood method(Red line: predictions)')
plt.xlabel('X[:, 1]')
plt.ylabel('Y')
plt.show()
In [23]:
L2_coeff = 1000 #setting L2 regularization parameter to 1000
w_maxAPosterior = np.linalg.solve(np.dot(X_input.T, X_input)+L2_coeff*np.eye(2), np.dot(X_input.T, Y_output)) #Finding weights for MAP estimation
Y_maxAPosterior = np.dot(X_input, w_maxAPosterior) #Finding predicted Y corresponding to w_maxAPosterior
In [28]:
plt.scatter(X_input[:,1], Y_output)
plt.plot(X_input[:,1],Y_maxLikelihood, color='red',label="maximum likelihood")
plt.plot(X_input[:,1],Y_maxAPosterior, color='green', label="map")
plt.title('Graph of MAP v/s ML method')
plt.legend()
plt.xlabel('X[:, 1]')
plt.ylabel('Y')
plt.show()
Thus, green line(MAP) fits really well to the trend and doesn’t bend towards the outlier while red line(ML) fails to do so.
So, guys, with this I conclude this tutorial. In the next tutorial, I will talk about L1 regularization or Lasso Regression. Stay tuned!