What is regularization in machine learning?
Avoiding overfitting is one of the most important components of training your Machine Learning model. If the model is overfitting, it will have low accuracy. This occurs when your model is attempting to capture the noise in your training dataset far too hard. By noise, we indicate data points that aren’t truly representative of your data’s true properties but are instead random chance.
- This is a type of regularized regression in Machine Learning in which the coefficient estimates are constrained, regularized, or shrunk towards zero. In other words, to avoid overfitting, this strategy discourages learning a more complicated or flexible model.
This is a simple linear regression relationship. The learned relation is denoted by Y, while the coefficient estimates for various variables or predictors are denoted by (X).
A loss function known as the residual sum of squares (RSS) is used in the fitting procedure. The coefficients are chosen in such a way that the loss function is minimized.
Now, the coefficients will be adjusted based on your training data. If the training data contains noise, the computed coefficients will not generalize well to subsequent data. This is when regularization enters the picture, shrinking or regularizing the learned estimations approaching zero.
Lasso and Ridge Regression
The main difference between this variant and ridge regression is that it penalizes high coefficients. As a penalty, it employs modulus rather than squares of β. This is referred to as the L1 norm
- The ridge regression is similar to solving an equation in which the sum of squares of coefficients is less than or equal to s. The Lasso is an equation in which the sum of the modulus of coefficients is less than or equal to s. For each value of shrinkage factor, s is a constant that exists. Constraint functions are another name for these equations.
In a specific problem, there are two parameters. The ridge regression can be written as β1² + β2² ≤ s. This means that for all points within the radius defined by β1² + β2² ≤ s, ridge regression coefficients have the minimum loss function.
Similarly, the equation for lasso became |β1|+|β2|≤ s. This means that for all locations within the diamond defined by |β1|+|β2|≤ s, lasso coefficients have the smallest loss function.
- In ridge regression, the shrinkage amount is added to RSS- loss function. By minimizing this function, the coefficients are now estimated. This is the tuning parameter that determines how much we want to penalize our model’s flexibility.
Increases in a model’s coefficients reflect increased flexibility, and if we wish to minimize the aforementioned function, these coefficients must be small. The Ridge regularization techniques in Machine Learning prevent coefficients from becoming too high in this manner.
When the tuning value is set to 0, the penalty component has no effect, and the ridge regression estimates are equivalent to least squares. However, when the tuning parameter approaches 0, the shrinkage penalty becomes more significant, and the ridge regression coefficient estimates approach zero. As can be seen, choosing a proper tuning parameter value is crucial.
This is where cross-validation comes conveniently. The L2 norm refers to the coefficient estimates derived by this procedure.
When we multiply each input by c, the corresponding coefficients are scaled by a factor of 1/c, hence the coefficients produced by the traditional least-squares approach are scale equivariant. As a result, the multiplication of predictor and coefficient remains the same regardless of how the predictor is scaled.
- Nevertheless, this is not the case with ridge regression, thus we must first normalize the predictors or scale them to the same scale before running ridge regression.
The model interpretability of ridge regression is a clear disadvantage. The coefficients for the least important predictors will be reduced to almost zero. However, they will never be exactly zero. In other words, all predictors will be included in the final model.
When the tuning parameter is large enough, the L1 penalty, in the case of the lasso, has the effect of driving some of the coefficient estimates to be exactly equal to zero. As a result, the lasso approach accomplishes variable selection as well as producing sparse models.
- Regularization in Machine Learning greatly reduces the model’s variance without significantly increasing its bias. As a result, the tuning parameter determines the impact on bias and variance in the regularization procedures discussed above. As the value of the tuning parameter increases, the value of the coefficients decreases, lowering the variance. This increase in tuning parameter is useful up to a point because it merely reduces variance (thus avoiding overfitting) without sacrificing any significant characteristics in the data. However, after a certain value, the model begins to lose crucial properties, resulting in bias and underfitting. As a result, the tuning parameter’s value should be specifically chosen.
Data regularization is a handy strategy for increasing the precision of your regularized models.