The term regularization in the context of machine learning refers to a collection of strategies that help the machine learn more than merely memorize.
When you train a machine learning model and it can provide correct results on training data but not on unknown data or a test dataset, you might claim your model is remembering rather than generalizing.
Let’s imagine you’re conducting a cat vs. dog classification and your trained model achieves 98 percent accuracy on training data, but only 82 percent accuracy on test data — your model is remembering rather than generalizing.
Let’s look at a real-life example. Assume that an e-commerce corporation wants to create a model to forecast whether a user will buy a product based on his or her usage history over the previous seven days, and then utilize the data to make better decisions for retargeted digital adverts. The number of pages viewed, total time spent, the number of searches conducted, average time spent, page revisits, and other information may be included in the user history.
They create a model that produces correct results on current data, but when they try to forecast using unknown data, it fails miserably. It may be deduced from this that the approach focuses on memory rather than learning.
So, what exactly is going on in the cases above? One major possibility is that the model has an overfitting problem, resulting in poor performance on data that hasn’t been observed. So, rather than learning, this is a location where you memorize.
- In summary, if your model’s assessment metrics for the training and testing datasets diverge significantly, it’s considered to have an overfitting problem.
Regularization
Regularization refers to a range of strategies for regularizing learning from specific characteristics in classical algorithms or neurons in neural network algorithms.
It normalizes and moderates weights associated with a feature or a neuron so that algorithms aren’t reliant on a small number of features or neurons to predict the outcome. This method aids in avoiding the issue of overfitting.
- This method reduces the computational cost of a complicated model by converting it to a simpler one to avoid overfitting.
Regularization Algorithms
- Ridge regression – Its purpose is to overcome problems such as data overfitting and multicollinearity in data. When there is considerable collinearity (the existence of near-linear connections among the independent variables) among the feature variables, a typical linear or polynomial regression model will fail. Ridge Regression adjusts the variables by a modest squared bias factor. The feature variable coefficients are pulled away from this rigidity by such a squared bias factor, providing a little bit of bias into the model but considerably lowering variation.
Ridge is an excellent way to prevent overfitting.
Use regularization to solve overfitting and feature selection if you have a model with a high number of features in the dataset and want to prevent making the model too complicated.
However, the ridge has one major drawback: the final model has all N characteristics.
Ridge regression decreases the two coefficients towards each other when the variables are highly linked. Lasso is torn between the two and prefers one over the other.
One never knows which variable will be chosen depending on the situation. Elastic-net is a hybrid of the two that tries to shrink while still doing the sparse selection.
- LASSO – It simply penalizes large coefficients, in contrast to Ridge Regression. When the hyperparameter is big enough, Lasso has the effect of driving certain coefficient estimations to be absolutely zero. As a result, Lasso conducts variable selection, resulting in models that are significantly easier to read than Ridge Regression models. In a nutshell, it’s about lowering variability and increasing the accuracy of linear regression models.
If we have a large number of features, LASSO works effectively for feature selection.
It reduces coefficients to zero and if a set of predictors is highly associated, lasso selects one and reduces the others to zero.