How to Test Machine Learning Models


  • Adversarial attacks: Testing models can help detect possible adversarial attacks. Rather than letting this attack happen in a production environment, a model can be tested with adversarial examples to increase its robustness prior to deployment.
  • Data integrity and bias: Data collected from most sources are usually unstructured and might reflect human bias that can be modeled during training.  This bias might be against a particular group either by gender, race, religion, or sexuality with varying consequences in society depending on the scale of use. During the evaluation, bias can be missed because it focuses mostly on performance and not the behavior of the model given the role of the data in this case.
  • Spot failure modes: Failure modes can occur when trying to deploy ML systems into production. These can be due to performance bias failures, robustness failures or model input failures. Some of these failures can be missed by evaluation metrics although they can signal problems. A model with an accuracy of 90% means that the model is finding it difficult to generalize with the 10% of the data. That can prompt you to check the data and look for errors giving you better insights on how to solve it. It is not all-encompassing and so structured tests for the possible scenarios that maybe encountered need to be established and help detect failure modes.

This article demonstrates how testing in machine learning differs from testing “normal” software and why evaluating model performance is not enough. You will learn how to test machine learning models and which principles and best practices you should follow.

Problems with Testing Machine Learning Models

Software developers write codes to produce deterministic behavior. Testing identifies explicitly which part of the code fails and provides a relatively coherent coverage measure (e.g., lines of code covered). It helps us in two ways:

  • Quality assurance; whether the software works according to requirements, and
  • Identify defects and flaws during development and in production.

Data scientists and ML engineers train models by feeding them examples and setting parameters. The model’s training logic produces the behavior. This process poses these challenges when testing ML models:

  • Lack of transparency. Many models work like black boxes.
  • Indeterminate modeling outcomes. Many models rely on stochastic algorithms and do not produce the same model after (re)training.
  • Generalizability. Models need to work consistently in circumstances other than their training environment.
  • Unclear idea of coverage. There is no established way to express testing coverage for machine learning models. “Coverage” does not refer to lines of code in machine learning as it does in software development. Instead, it might relate to ideas like input data and model output distribution.
  • Resource needs. Continuous testing of ML models requires resource and is time-intensive.

These issues make it difficult to understand the reasons behind a model’s low performance, interpret the results, and assure that our model will work even when there is a change in the input data distribution (data drift) or in the relationship between our input and output variables (concept drift).

Evaluation vs. Testing

Many practitioners may rely solely on machine learning model performance evaluation metrics. Evaluation, however, is not the same as testing. It is important to know the difference.

ML model evaluation focuses on the overall performance of the model. Such evaluations may consist of performance metrics and curves, and perhaps examples of incorrect predictions. This model evaluation is a great way to monitor your model’s outcome between different versions. Remember that it does not tell us a lot about the reasons behind the failures and the specific model behaviors.

For example, your model might suffer a performance drop in a critical data subset while its overall performance doesn’t change or even improve. In another case, model retraining based on new data does not produce performance change but instead could introduce unnoticed social biases towards a specific demographic group.

Machine learning tests, on the other hand, go beyond evaluating the models’ performance on subsets of data. It ensures that the composite parts of the ML system are working effectively to achieve the desired level of quality results. You can say that it helps teams point out flaws in the code, data, and model so they can be fixed.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Principles & Best Practices

Testing is not easy, and testing Machine Learning models is even harder. You need to prepare your workflow for unexpected events while working with dynamic inputs, black-box models, and shifting input/output relationships.

For this reason, it is worth following these established best practices in software testing:

  • Test after introducing a new component, model, or data, and after model retraining.
  • Test before deployment and production.
  • Write tests to avoid recognized bugs in the future.

Testing ML models has additional requirements. You also need to follow testing principles specific to the ML problem:


Robustness requires your model to produce a relatively stable performance even in the case of radical real-time change of data and relationships.

You can strengthen robustness in the following ways:

  • Have a Machine Learning procedure that your team follows.
  • Explicitly test for robustness (e.g., drift, noise, bias).
  • Have a monitoring policy for deployed models.


Maintaining interpretability makes you understand specific aspects of your model:

  • Whether the model predicts outputs as it should (e.g., based on human evaluators).
  • How input variables contribute to the output.
  • Whether the data/model has underlying biases


Model changes occur due to parameter adjustments, retraining, or new data, and to scale the model in production, no matter the platform it is used on, you need to ensure that your results are reproducible.

Reproducibility has many aspects. Although this is not an article on the subject, here are some tips to ensure that your model is reproducible:

  • Use a fixed random seed by a deterministic random number generator.
  • Make sure that the components run in the same order and receive the same random seed.
  • Use version control even for preliminary iterations.

How to Test Machine Learning Models

Many existing ML model testing practices follow manual error analysis (e.g., failure mode classification), making them slow, costly, and error-prone. A proper ML model testing framework should systematize these practices.

You can map software development test types to Machine Learning models by applying their logic on Machine Learning behavior:

  • Unit test. Check the correctness of individual model components.
  • Regression test. Check whether your model breaks and test for previously encountered bugs.
  • Integration test. Check whether the different components work with each other within your machine learning pipeline.

Specific testing tasks can belong to different categories ( model evaluation, monitoring, validation) depending on your specific problem case, circumstance, and organization structure. This article focuses on tests specific to the Machine Learning modeling problem (post-train tests), so we do not cover other test types. Make sure that you integrate your machine learning model tests into your wider Machine Learning model monitoring framework.

Testing Trained Models

For code, you can write manual test cases. This is not a great option for Machine Learning models as you cannot cover all edge cases in a multi-dimensional input space.

Instead, test model performance by doing monitoring, data slicing, or property-based testing targeted at real world problems.

You can combine this with test types that examine specifically the internal behavior of your trained models (post-train tests):

We will discuss each type below. If you are interested in an overview of approaches to Machine Learning model testing, check out this post.

Invariance Test

The invariance test defines input changes that are expected to leave model outputs unaffected.

The common method for testing invariance is related to data augmentation. You pair up modified and unmodified input examples and see how much this affects the model output.

One example is to check whether a person’s name affects their health. Our default assumption can be that there should be no relationship between the two. Having a test failing based on this assumption might imply a hidden demographic connection between name and height (because our data covers multiple countries with different names and height distributions).

Directional Expectation Test

You can run directional expectation tests to define input distribution changes expected effects on the output.

A typical example is testing assumptions about the number of bathrooms or property size when predicting house prices. A higher number of bathrooms should mean a higher price prediction. Seeing a different result might reveal wrong assumptions about the relationship between our input and output or the distribution of our dataset (e.g., small studio apartments are overrepresented in expensive neighborhoods).

Minimum Functionality Test

The minimum functionality test helps you decide whether individual model components behave as you expect. The reasoning behind these tests is that overall, output-based performance can conceal critical upcoming issues in your model.

Here are ways to test individual components:

  • Create samples that are “very easy” for the model to predict, in order to see if they consistently deliver these types of predictions.
  • Test data segments and subsets that meet a specific criteria (e.g., run your language model only on short sentences of your data to see its ability to “predict short sentences”).
  • Test for failure modes you have identified during manual error analysis.

Test Model Skills

The software development tests often focus on the project’s code. However, this does not always work with ML workflows, as code is not the only element and behavior does not map so clearly to pieces of code.

A more ‘behavioral’ way to organize a machine learning test is to focus on the “skills” we expect from the model (as suggested by this paper about testing NLP models). For example, we can check whether our natural language model picks up information about vocabulary, names, and arguments. From a time series model, we should expect to recognize trends, seasonalities, and change points.

You can test these skills programmatically by checking for the above discussed model properties (i.e., invariance, directional expectation, minimum functionality).

Test Performance

Testing a full model takes a lot of time, especially if you do integration tests.

To save on resources and speed up testing, test small components of the model (e.g., check whether a single iteration of gradient descent leads to a decrease in loss) or use just a small amount of data. You can also use simpler models to detect shifts in feature importance and catch concept drift and data drift in advance.

For integration tests, have simple tests running continuously with each iteration and keep bigger and slower tests running in the background.


Testing is an iterative process and can be difficult when working on ML projects that may require huge amounts of data along with long model training cycles. For small and large teams, time is an important resource that they do not have a monopoly on. This means that teams have to pick the best test frameworks that work for their unique situation. This might mean testing and validating random data samples in small batches or unit tests for testing specific behaviors of the model, among others that can be chosen.

It is encouraged that different teams, no matter their size, integrate these practices into their overall ML project lifecycle to improve the overall quality of their product. Deepchecks is one of the best packages to start with if quick data and model validation tests are desired in your project. Great expectations are mostly for data quality, and both deepchecks and great expectations can be integrated with typical testing packages such as the PyTest framework, which makes it easier to write and scale tests in ML applications.

So start testing today!

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Recent Blog Posts