Do you run a machine learning framework, but you are not sure if it works? Does your model performance decrease, but you fail to see how and why? Do you want to test your machine learning model but do not know where to start?
This article shows you how testing machine learning code differs from testing ‘normal’ software and why your textbook model evaluation routines do not work.
You learn how to test machine learning models, principles, and best practices to follow when you plan your testing framework.
Please note that in this article, we talk about testing as a software development concept. For testing in the machine learning evaluation process, check out this general overview.
Problems With Testing Machine Learning Models
In software development, humans write code to produce deterministic behavior. Testing identifies explicitly which part of the code fails and provides a relatively coherent coverage measure (e.g., lines of code covered). It helps us in two ways:
- Quality assurance: whether the software works according to requirements;
- Identify defects and flaws during development and in production.
In machine learning, humans train models by feeding them with examples and setting parameters. The model training logic produces the behavior. This difference raises challenges to testing for many reasons:
- Lack of transparency: Models often work like black boxes.
- Indeterminate outcomes: Machine learning model outputs are nondeterministic, making it harder to check whether they work well.
- Generalizability: Models need to work consistently in circumstances other than their training environment.
- Unclear idea of coverage: There is no established way to express testing coverage for machine learning models.
- Resource need: Continuous testing of machine learning models is resource and time-intensive.
These issues make it difficult to understand the reasons behind a model’s low performance, interpret the results, and assure that our model will work even when the word out there changes or we have a different upstream process.
Difference Between Machine Learning Model Evaluation and Testing
Many practitioners may rely solely on machine learning model validation, so it can be helpful to identify the differences between model evaluation and testing.
Machine learning model evaluation focuses on the overall performance of the model. A usual machine learning evaluation report consists of performance metrics and curves, some operational statistics, and perhaps examples of incorrect predictions.
This way of model evaluation is a great way to monitor your model’s outcome between different versions. However, it does not tell us a lot about the reasons for failure and specific model behaviors.
Examples are performance regression in a critical data subset when overall performance improves or social bias introduced by new training data.
To avoid such issues, you need to test your models to specify the reasons and mechanisms behind behavior change and track behavioral regression for specific failure modes.
Principles and Best Practices in Machine Learning Model Testing
Testing is not easy, and testing machine learning models is even harder. You need to prepare your workflow for unexpected events while working with changing inputs and nondeterministic outputs.
For this reason, it is worth following established best practices in software testing:
- Test after introducing a new component, model, or data, and after model retraining.
- Test before deployment and production.
- Write tests to avoid recognized bugs in the future.
Adhering to the following additional testing principles may also help to cover areas you would otherwise skip over:
Let’s discuss them in detail.
Robustness requires your model to produce a relatively stable performance even in the case of radical real-time change of data and relationships.
You can strengthen robustness in the following ways:
- Peer review your model to catch mistakes;
- Have a machine learning procedure that your team follows;
- Explicitly test for robustness (e.g., drift, noise, bias);
- Have a monitoring policy for deployed models.
Maintaining interpretability makes you understand specific aspects of your model:
- Whether the model predicts outputs as it should (e.g., based on human evaluators);
- How input variables contribute to the output;
- Whether the data/model has underlying biases.
To understand how your model changes thanks to parameter adjustments, retraining, or new data, especially within a team, you need to make your results reproducible.
Reproducibility has many aspects. Here are some tips:
- Use fixed random seed by a deterministic random number generator;
- Make sure the components run in the same order and receive the same random seed;
- User version control even for preliminary iterations.
How to Test Machine Learning Models?
Many existing model testing practices follow manual error analysis (e.g., failure mode classification), making them slow, costly, and error-prone. A proper model testing framework should systematize these practices.
The question is, how?
You can map software development test levels to machine learning models by applying the logic on the problem of machine learning behavior:
- Unit test: Check the correctness of individual model components.
- Regression test: Check whether your model breaks and replicates previously encountered bugs.
- Integration test: Check whether the different components work with each other within your machine learning pipeline.
Because of the overlaps between testing, evaluation, and monitoring, there are testing areas that you can find under other labels.
Examples are checking your data for consistency, integrity, and assumptions, ensuring that your model input and output follow the expected schema, or monitoring data leakage between training and test sets.
This article focuses on tests specific to the machine learning modeling problem (post-train tests), so we do not cover other test types. Make sure that you integrate your machine learning model tests into your wider machine learning model monitoring framework.
Testing Trained Models
For code, you can write manual test cases. This is not a great option for machine learning models as you cannot cover all edge cases in a multi-dimensional input space.
Instead, test model performance by feeding it randomly generated data and evaluating its performance or by doing targeted property-based testing based on your operation domain.
You can combine this with test types that examine specifically the internal behavior of your trained models (post-train tests):
- Invariance test;
- Directional expectation test;
- Minimum functionality test.
We will discuss each type below. If you are interested in an overview of approaches to machine learning model testing, check out this post.
Invariance tests check whether changing model inputs break performance.
The common method for testing invariance is related to data augmentation. You pair up modified and unmodified input examples and see how much this affects the model output.
One example is to check whether a person’s name affects their health. Our default assumption can be that there should be no relationship between the two. Having a test failing based on this assumption would imply a hidden demographic connection between name and height.
Directional Expectation Test
You can run directional expectation tests to examine whether your model reacts well enough to relevant changes and aligns with your assumptions. It is similar to the invariance test but focuses on the model’s ability to pick up appropriate input changes.
A typical example is testing assumptions about the number of bathrooms or property size when predicting house prices. A higher number of bathrooms should mean a higher price prediction. Seeing a different result might reveal wrong assumptions about the relationship between our input and output or the distribution of our dataset (e.g., small houses are overrepresented in rich neighborhoods).
Minimum Functionality Test
The minimum functionality test helps you decide whether individual model components behave as you expect. The reasoning behind these tests is that overall, output-based performance can conceal critical upcoming issues in your model.
Here are ways to test individual components:
– Measure performance for data segments and subsets;
– Identify areas where errors have strong consequences;
– Test for failure modes you identified during manual error analysis.
Software development test organization often mirrors the project’s code repository. However, this does not always work with machine learning workflows as code is not the only element, and behavior does not map so clearly to pieces of code.
A more ‘behavioral’ way to organize machine learning tests is to focus on the “skills” we expect from the model. For example, we can check whether our natural language model picks up information about vocabulary, names, and arguments. From a time series model, we should expect to recognize trends, seasonalities, change points.
Testing a full model takes lots of time, especially if you do integration tests.
To save on resources and speed up testing, test small components of the model (e.g., a single iteration of gradual descent) or use simple models and just a small amount of data.
For integration tests, have simple tests running continuously with each iteration and keep bigger and slower tests running in the background.
Test Your Machine Learning Models
In this article, you learned how testing machine learning applications differs from testing in software development, its main issues, and how it differs from model evaluation. You also learned about different approaches to test your models.
Trying out and implementing different testing methods is not an easy task, especially if you want to integrate them within your overall machine learning monitoring framework. To save on time and resources, you can implement frameworks like Deepchecks that automate this process for you while keeping you up to date on the latest research and implement best practices.
Do you want to learn how? Book a demo, and we will show it to you!