Avoiding overfitting is one of the most important components of training your Machine Learning model. If the model is overfitting, it will have low accuracy. This occurs when your model is attempting to capture the noise in your training dataset far too hard. By noise, we indicate data points that aren’t truly representative of your data’s true properties but are instead random chance.
This is a simple linear regression relationship. The learned relation is denoted by Y, while the coefficient estimates for various variables or predictors are denoted by (X).
A loss function known as the residual sum of squares (RSS) is used in the fitting procedure. The coefficients are chosen in such a way that the loss function is minimized.
Now, the coefficients will be adjusted based on your training data. If the training data contains noise, the computed coefficients will not generalize well to subsequent data. This is when regularization enters the picture, shrinking or regularizing the learned estimations approaching zero.
The main difference between this variant and ridge regression is that it penalizes high coefficients. As a penalty, it employs modulus rather than squares of β. This is referred to as the L1 norm
In a specific problem, there are two parameters. The ridge regression can be written as β1² + β2² ≤ s. This means that for all points within the radius defined by β1² + β2² ≤ s, ridge regression coefficients have the minimum loss function.
Similarly, the equation for lasso became |β1|+|β2|≤ s. This means that for all locations within the diamond defined by |β1|+|β2|≤ s, lasso coefficients have the smallest loss function.
Increases in a model’s coefficients reflect increased flexibility, and if we wish to minimize the aforementioned function, these coefficients must be small. The Ridge regularization techniques in Machine Learning prevent coefficients from becoming too high in this manner.
This is where cross-validation comes conveniently. The L2 norm refers to the coefficient estimates derived by this procedure.
When we multiply each input by c, the corresponding coefficients are scaled by a factor of 1/c, hence the coefficients produced by the traditional least-squares approach are scale equivariant. As a result, the multiplication of predictor and coefficient remains the same regardless of how the predictor is scaled.
The model interpretability of ridge regression is a clear disadvantage. The coefficients for the least important predictors will be reduced to almost zero. However, they will never be exactly zero. In other words, all predictors will be included in the final model.
When the tuning parameter is large enough, the L1 penalty, in the case of the lasso, has the effect of driving some of the coefficient estimates to be exactly equal to zero. As a result, the lasso approach accomplishes variable selection as well as producing sparse models.
Data regularization is a handy strategy for increasing the precision of your regularized models.