As decision trees are a non-parametric supervised implementation technique for classification as well as regression tasks, overfitting on the training datasets is a common problem. Given the architecture of the model itself, if the model is allowed to be trained to its full strength, the model is almost always going to overfit the training data. Fortunately, there are various techniques that are available to avoid and prevent overfitting in decision trees. The following are some of the commonly used techniques to avoid overfitting:
Decision tree models are usually allowed to grow to their maximum depth. As discussed above, it is usually going to cause overfitting in the model, which is undesirable. To prevent this, we use pruning, which refers to the method of removing some parts of a given tree to prevent its growth to its full depth. This is usually achieved by hyperparameters tuning to optimal values. There are two types of pruning that are used in decision trees:
This technique refers to the early stopping mechanism, where we do not allow the training process to go through,consequently preventing the overfitting of the model. It involves tuning the hyperparameters like, depth, minimum samples, and minimum sample split. These values can be tuned to ensure that we are able to achieve early stopping. Sklearn module for decision trees has these arguments built-in and they can be fine-tuned and changed easily for experiments to achieve optimal results.
This technique allows decision trees to grow to their full depth in the training process, then starts removing the branches of the trees to prevent the model from overfitting. CCP (Cost Complexity Pruning) is one of the most prominent techniques used for post-pruning, CCP Alpha is the parameter being used for controlling the post-pruning process, with the increase in the value for ccp_alpha, more nodes from a given tree are pruned. The process is continued until we are able to achieve an optimum value where the drop in the accuracy of the tree takes a significant nosedive on the holdout dataset.
Ensemble – Random Forest:
Random Forests is an ensemble method of implementation of tree-based algorithms used for both classification and regression. It uses bootstrapping multiple decision trees to prevent overfitting by sampling and aggregation techniques.