When You Shouldn’t Use Ensemble Learning


In his work from 1785, Philosopher Marquis de Condorcet expressed what is now known as “Condorcet’s jury theorem”. The idea is as follows if each member of a jury has even a slightly better chance at voting correctly than a random choice, then majority voting with a large jury will result in high accuracy. This can be viewed as the inspiration for ensemble learning.


Ensemble learning involves training multiple ML models whose predictions are then combined in order to generate the final prediction. This process typically reduces variance and provides more robust models. More often than not, winners of Kaggle competitions have used ensemble methods, the winner of the Netflix Prize in 2009 used an ensemble of models as well. However, there are situations in which using ensembles may not be the correct choice, and while they are a popular choice for ML competitions, they are not used in production in the “real world” quite as often as we might expect. Here’s why.


Note: While there is a wide variety of methods that fall under the category of ensemble learning, in this post we will focus on the practices of stacking and blending of multiple fully trained models, rather than methods such as gradient boosting (used in AdaBoost for example) and bagging (used in random forest) which are not always perceived as ensembles of models.


Significant Overhead

Perhaps the case that jumps to mind for when ensembles would not be the way to go, is when you can’t afford their extra overhead in training time and inference time, or the large memory footprint of such models. The math is pretty simple: using an ensemble of 10 models requires 10 times the amount of resources.


Thus, if your model must operate in real-time (stream processing rather than batch processing), it is essential that your model is lean in order to reduce latency. Additionally, if you would like to deploy your model as an on-device model, the enlarged memory footprint of ensembles can have significant drawbacks. Finally, the additional training time could be especially significant in cases where you might need to regularly retrain your model to avoid model drift.

Fitting a large ensemble in an on-device setting is not practical

Fitting a large ensemble in an on-device setting is not practical (source)


It is worth noting that using knowledge distillation we can create a single lean model that imitates the ensemble, thus enabling the stability and robustness of ensembles while having a single model with better performance and a smaller memory footprint.

Explainable AI (XAI)

In an age where AI systems are becoming more and more central in our everyday lives, it is important that we be able to rely on these systems. Of course, if our model has high accuracy we are likely to put more trust in it, but if we could understand what goes on in the model’s “head” and understand the logic behind each prediction, we will feel more confident about using our model to make real-world decisions, and we could even monitor the prediction process. That is the idea behind explainability in AI.


What comprises a sufficient “explanation”? There are various possible definitions, but generally, we expect some sort of decomposition of the input, such that we can focus on the contributions of the features that have the largest impact on the prediction. For instance, in a sentiment analysis task for movie reviews, we would like to highlight the words that have a very positive or negative sentiment.

Explainable sentiment analysis using SHAP

Explainable sentiment analysis using SHAP (source)

Ensembles of models are inherently less explainable, just as a jury’s decision is made by a collection of different arguments that affect each member of the jury independently. While it is theoretically possible to aggregate explanations of different models’ predictions to provide a “general explanation” for the final prediction, this is not a simple task from an engineering point of view.

Thus, if it is important for your business to be able to produce explanations for your model’s prediction, we recommend sticking to a single model, rather than using an ensemble.


Before paying the prices of training and using ensemble methods, it is important to verify that you are actually leveraging the power of ensembles, and not simply training redundant models.

Ensembles of Ensembles?

A widely used regularization technique for neural networks is dropout. With this method during training we randomly “disable” some of the neurons so that there is some redundancy in the network, and we do not rely too much on any single neuron. Later, during inference, we use the entire network to make predictions.

This method in itself can be viewed as a type of ensemble, and thus creating an ensemble of NNs that were trained with dropout may be redundant.

An Ensemble of One

Ensembles are especially powerful where the different models it is made up of have different expertise. For example, training different models to be experts at solving different types of confusions in a classification task can result in a much more accurate model when these models are merged in an ensemble than a single generalist model. However, when we train multiple similar models on the same data, the resulting ensemble might perform no better than any single model.

A jury of duplicates of a single person does not provide any benefits

A jury of duplicates of a single person does not provide any benefits (source)


Ensemble methods can be useful in reducing variance and making more robust models. However, there are significant downsides to using ensemble methods such as lack of explainability and decreased performance. Additionally, it is worth remembering that ensembles draw the power from the ability to aggregate diverse models that focus on different aspects of the problem. 


Finally, if you find that you achieve better results using an ensemble method than with any single model, try out distilling the knowledge of the ensemble to get better performance and a more explainable model.


Further Reading

Hinton, Geoffrey; Vinyals, Oriol; Dean, Jeff (2015). “Distilling the knowledge in a neural network“. arXiv
Hinton slides on “Dark Knowledge”
Analysis of dropout learning regarded as ensemble learning
KDNuggets: Ensemble learning
Why Use Ensemble Learning
When should I not use an ensemble classifier 
Ensemble Learning

Subscribe to our newsletter

Do you want to stay informed?
Keep up-to-date with industry news, the latest trends in MLOps, and observability of ML systems.

Subscribe to our newsletter:

Related articles

Preparing Your Data for Machine Learning: Full Guide for Data Preparation

ML Model Monitoring Best Practices for Performance and Cost Optimization

Training, Validation and Test Sets
Training, Validation and Test Sets: What Are the Differences?