# Lasso Regression Using Python

Hi Everyone! Today, we will learn about Lasso regression/L1 regularization, the mathematics behind lit and how to implement lasso regression using Python!

**Building foundation to implement Lasso Regression using Python**

#### Sum of squares function

- Firstly, let us have a look at the Sum of square of errors function, that is defined as
- It is also important to note that the first requirement that should be fulfilled for any data set that we want to use for making machine learning models is that the data points should be random in nature and data size should be large.
- But this requirement is not fulfilled sometimes. That is, in some cases, number of features/dimensions(D) is greater than the number of samples/observations(N). Thus, the data set becomes fatty(D >> N) in nature instead of skinny(D << N).
- One thing to be noted is that even completely random noise can also improve R squared. But, this is very unwanted. We don’t want to let noise or unwanted features alter our outputs. This can be achieved by means of regularization.
- In case of L1 regularization, few weights, corresponding to the most important features, are kept non-zero and other/ most of them are kept equal to zero.

#### Gaussian distribution and probabilities

- For any data set which is random in nature, it should follow Gaussian distribution.
- Any Gaussian distribution is defined by its mean, µ and variance,
and is represented by where X is the input matrix. - For any point x
_{i}, the probability of x_{i}is given by the expression . - Also, because the occurrence of each x
_{i}is independent to the occurrence of other, the joint probability of each of them is given by

#### Likelihood function

- Also, linear regression is the solution which gives the maximum likelihood to the line of best fit.
- Now, the question arises, what is likelihood? We define Likelihood as the probability of data X given a parameter of interest, in our case, it’s µ. So, we define likelihood function as
. - Linear regression maximizes this function for the sake of finding the line of best fit. We do this by finding the value of µ which maximizes this function and we can say that it is very likely that our data has come from a population that has µ as mean.
- For solving this, first we take the natural log of the likelihood function(L), then differentiate L wrt µ and then equate this to zero.

Hence. this value of µ maximizes the likelihood function.

#### Maximize likelihood and minimizing error function

- One thing to note here is that maximizing likelihood function L is equivalent to minimizing error function E. Also, y is Gaussian distributed with mean transpose(w)*X and variance sigma-square or
or where ε is Gaussian distributed noise with zero mean and sigma-square variance. - This is equivalent to saying that in linear regression, errors are Gaussian and the trend is linear.

#### Why do we need regularization?

- Now, let’s understand why the need for introduction to regularization was there. The answer is outliers! In the presence of outliers, the linear regression gets the line of best fit which has some diversion from the real trend. This is because it follows the method of least squares and in order to minimize the error, it makes the trend line bent towards the outliers. This makes the prediction less accurate and far from what could be in the absence of outliers. To handle this problem, we introduce the method of Regularization.

#### The concept of Penalty

- L1 regularization uses L1 norm as a penalty term.

#### Likelihood and Prior probabilities

- Plain squared error maximizes likelihood as shown above. But now, since we have two terms in the cost function, we no longer do this. We now have two probabilities, one is likelihood probability and other one is prior. Following formula gives Likelihood:

and formula for prior is:

- We call P(w) as prior because it represents our prior beliefs about w. Thus, now, J is proportional to -ln(P(Y|X, w))-ln(P(w)). Also, by Baye’s rule, we get, P(w|Y,X) is proportional to P(Y|X,w)*P(w). We call P(w|Y,X) as the Posterior probability. The method of maximizing P(w|Y,X) is called Maximizing A Posterior or MAP.

### Implementing Lasso Regression using Python

Now, let’s see how to implement L1 regularization or Lasso Regression by using Gradient Descent(I will be covering gradient descent in a separate post).

### Importing libraries

In [1]:

```
from __future__ import print_function, division
from builtins import range
import numpy as np # importing numpy with alias np
import matplotlib.pyplot as plt # importing matplotlib.pyplot with alias plt
```

### Defining number of observations and dimensions

In [2]:

```
No_of_observations = 50
No_of_Dimensions = 50
X_input = (np.random.random((No_of_observations, No_of_Dimensions))-0.5)*10 #Generating 50x50 matrix forX with random values centered round 0.5
w_dash = np.array([1, 0.5, -0.5] + [0]*(No_of_Dimensions-3)) # Making first 3 features significant by setting w for them as non-zero and others zero
Y_output = X_input.dot(w_dash) + np.random.randn(No_of_observations)*0.5 #Setting Y = X.w + some random noise
```

### Learning rate for cost function

In [3]:

```
costs = [] #Setting empty list for costs
w = np.random.randn(No_of_Dimensions)/np.sqrt(No_of_Dimensions) #Setting w to random values
L1_coeff = 5
learning_rate = 0.001 #Setting learning rate to small value so that the gradient descent algo doesn't skip the minima
```

In [4]:

```
for i in range(500):
Yhat = X_input.dot(w)
delta = Yhat - Y_output #the error between predicted output and actual output
w = w - learning_rate*(X_input.T.dot(delta) + L1_coeff*np.sign(w)) #performing gradient descent for w
meanSquareError = delta.dot(delta)/No_of_observations #Finding mean square error
costs.append(meanSquareError) #Appending mse for each iteration in costs list
```

### Plotting costs for Lasso Regression using Python

In [5]:

```
plt.plot(costs)
plt.title("Plot of costs of L1 Regularization")
plt.ylabel("Costs")
plt.show()
```

### Printing weights

In [6]:

```
print("final w:", w) #The final w output. As you can see, first 3 w's are significant , the rest are very small
```

### Plotting weights

In [7]:

```
# plot our w vs true w
plt.plot(w_dash, label='true w')
plt.plot(w, label='w_map')
plt.legend()
plt.show()
```