If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.

How to avoid overfitting in a decision tree?

Anton Knight
Anton KnightAnswered

As decision trees are a non-parametric supervised implementation technique for classification as well as regression tasks, overfitting on the training datasets is a common problem. Given the architecture of the model itself, if the model is allowed to be trained to its full strength, the model is almost always going to overfit the training data. Fortunately, there are various techniques that are available to avoid and prevent overfitting in decision trees. The following are some of the commonly used techniques to avoid overfitting:

Pruning

Decision tree models are usually allowed to grow to their maximum depth. As discussed above, it is usually going to cause overfitting in the model, which is undesirable. To prevent this, we use pruning, which refers to the method of removing some parts of a given tree to prevent its growth to its full depth. This is usually achieved by hyperparameters tuning to optimal values. There are two types of pruning that are used in decision trees:

  – Pre-Pruning

This technique refers to the early stopping mechanism, where we  do not allow the training process to go through,consequently preventing the overfitting of the model. It involves tuning the hyperparameters like, depth, minimum samples, and minimum sample split. These values can be tuned to ensure that we are able to achieve early stopping. Sklearn module for decision trees has these arguments built-in and they can be fine-tuned and changed easily for experiments to achieve optimal results.

   – Post-Prunning

This technique allows decision trees to grow to their full depth in the training process, then starts removing the branches of the trees to prevent the model from overfitting. CCP (Cost Complexity Pruning) is one of the most prominent techniques used for post-pruning, CCP Alpha is the parameter being used for controlling the post-pruning process, with the increase in the value for ccp_alpha, more nodes from a given tree are pruned. The process is continued until we are able to achieve an optimum value where the drop in the accuracy of the tree takes a significant nosedive on the holdout dataset.

Ensemble – Random Forest:

Random Forests is an ensemble method of implementation of tree-based algorithms used for both classification and regression. It uses bootstrapping multiple decision trees to prevent overfitting by sampling and aggregation techniques.

Open source package for ml validation

Build Test Suites for ML Models & Data with Deepchecks

Get StartedOur GithubOur Github

Subscribe to Our Newsletter

Do you want to stay informed? Keep up-to-date with industry news, the latest trends in MLOps, and observability of ML systems.