Would you like to be sure that your machine learning system is still performing as expected? Does your model performance decrease, but you fail to see why? Do you want to test your machine learning model but do not know where to start?
This article demonstrates how testing machine learning code differs from testing ‘normal’ software and why evaluating model performance is not enough.
Reading this article, you will learn how to test machine learning models, and what principles and best practices you should follow.
Problems With Testing Machine Learning Models
Software developers write code to produce deterministic behavior. Testing identifies explicitly which part of the code fails and provides a relatively coherent coverage measure (e.g., lines of code covered). It helps us in two ways:
- Quality assurance: whether the software works according to requirements
- Identify defects and flaws during development and in production
Data Scientists and Machine Learning Engineers train models by feeding them with examples and setting parameters. The model training logic produces the behavior. This process raises the following challenges to testing machine learning models:
- Lack of transparency: Many models work like black boxes.
- Indeterminate modeling outcomes: Many models rely on stochastic algorithms and do not produce the same model after (re)training.
- Generalizability: Models need to work consistently in circumstances other than their training environment.
- Unclear idea of coverage: There is no established way to express testing coverage for machine learning models. “Coverage” does not refer to lines of code in machine learning as it does in software development. Instead, it might relate to ideas like input data and model output distribution.
- Resource need: Continuous testing of machine learning models is resource and time-intensive.
These issues make it difficult to understand the reasons behind a model’s low performance, interpret the results, and assure that our model will work even when there is a change in the input data distribution (“data drift”) or in the relationship between our input and output variables (“concept drift”).
Difference Between Machine Learning Model Evaluation and Testing
Many practitioners may rely solely on machine learning model performance evaluation. However, evaluation is not the same as testing. It is important to identify their differences.
Machine learning model evaluation focuses on the overall performance of the model. Such evaluations can consist of performance metrics and curves, and perhaps examples of incorrect predictions.
This way of model evaluation is a great way to monitor your model’s outcome between different versions. However, it does not tell us a lot about the reasons behind the failures and the specific model behaviors.
For example, your model might suffer a performance drop in a critical data subset while its overall performance doesn’t change or even improves. Or, in another case, model retraining on new data does not produce performance change but introduces unnoticed social bias towards a specific demographic group.
To avoid such issues, you need to test your models in order to be able to narrow down the reasons and mechanisms behind behavior change and track behavioral degradation for specific failure modes.
Principles and Best Practices in Machine Learning Model Testing
Testing is not easy, and testing machine learning models is even harder. You need to prepare your workflow for unexpected events while working with dynamic inputs, black-box models, and shifting input/output relationships.
For this reason, it is worth following established best practices in software testing:
- Test after introducing a new component, model, or data, and after model retraining
- Test before deployment and production
- Write tests to avoid recognized bugs in the future
However, testing machine learning models has additional requirements. You also need to follow testing principles specific to machine learning problems:
Let’s discuss them in detail.
Robustness requires your model to produce a relatively stable performance even in the case of radical real-time change of data and relationships.
You can strengthen robustness in the following ways:
- Have a machine learning procedure that your team follows.
- Explicitly test for robustness (e.g., drift, noise, bias).
- Have a monitoring policy for deployed models.
Maintaining interpretability makes you understand specific aspects of your model:
- Whether the model predicts outputs as it should (e.g., based on human evaluators)
- How input variables contribute to the output
- Whether the data/model has underlying biases
To understand how your model changes thanks to parameter adjustments, retraining, or new data, especially within a team, you need to make your results reproducible.
Reproducibility has many aspects. Here are some tips:
- Use a fixed random seed by a deterministic random number generator.
- Make sure that the components run in the same order and receive the same random seed.
- Use version control even for preliminary iterations.
How to Test Machine Learning Models?
Many existing model testing practices follow manual error analysis (e.g., failure mode classification), making them slow, costly, and error-prone. A proper model testing framework should systematize these practices.
The question is, how?
You can map software development test types to machine learning models by applying their logic on machine learning behavior:
- Unit test: Check the correctness of individual model components.
- Regression test: Check whether your model breaks and test for previously encountered bugs.
- Integration test: Check whether the different components work with each other within your machine learning pipeline.
Specific testing tasks can belong to different categories (e.g. model evaluation, monitoring, validation) depending on your specific problem case, circumstance, and organization structure. This article focuses on tests specific to the machine learning modeling problem (post-train tests), so we do not cover other test types. Make sure that you integrate your machine learning model tests into your wider machine learning model monitoring framework.
Testing Trained Models
For code, you can write manual test cases. This is not a great option for machine learning models as you cannot cover all edge cases in a multi-dimensional input space.
Instead, test model performance by doing monitoring, data slicing, or property-based testing targeted at real world problems.
You can combine this with test types that examine specifically the internal behavior of your trained models (post-train tests):
We will discuss each type below. If you are interested in an overview of approaches to machine learning model testing, check out this post.
Invariance test defines input changes that are expected to leave model outputs unaffected.
The common method for testing invariance is related to data augmentation. You pair up modified and unmodified input examples and see how much this affects the model output.
One example is to check whether a person’s name affects their health. Our default assumption can be that there should be no relationship between the two. Having a test failing based on this assumption might imply a hidden demographic connection between name and height (e.g., because our data covers multiple countries with different names and height distributions).
Directional Expectation Test
You can run directional expectation tests to define input distribution changes’ expected effects on the output.
A typical example is testing assumptions about the number of bathrooms or property size when predicting house prices. A higher number of bathrooms should mean a higher price prediction. Seeing a different result might reveal wrong assumptions about the relationship between our input and output or the distribution of our dataset (e.g., small studio apartments are overrepresented in expensive neighborhoods).
Minimum Functionality Test
The minimum functionality test helps you decide whether individual model components behave as you expect. The reasoning behind these tests is that overall, output-based performance can conceal critical upcoming issues in your model.
Here are ways to test individual components:
- Create samples that are “very easy” for the model to predict, in order to see if they consistently deliver these types of predictions.
- Test data segments and subsets that meet a specific criteria (e.g., run your language model only on short sentences of your data to see its ability to “predict short sentences”).
- Test for failure modes you have identified during manual error analysis.
Test Model Skills
Software development test organization often mirrors the project’s code repository. However, this does not always work with machine learning workflows as code is not the only element, and behavior does not map so clearly to pieces of code.
A more ‘behavioral’ way to organize machine learning tests is to focus on the “skills” we expect from the model (as suggested by this paper about testing NLP models). For example, we can check whether our natural language model picks up information about vocabulary, names, and arguments. From a time series model, we should expect to recognize trends, seasonalities, and change points.
You can test these skills programmatically by checking for the above discussed model properties (i.e., invariance, directional expectation, minimum functionality).
Testing a full model takes a lot of time, especially if you do integration tests.
To save on resources and speed up testing, test small components of the model (e.g., check whether a single iteration of gradient descent leads to loss decrease) or use just a small amount of data. You can also use simpler models to detect shifts in feature importance and catch concept drift and data drift in advance.
For integration tests, have simple tests running continuously with each iteration and keep bigger and slower tests running in the background.
Test Your Machine Learning Models
In this article, you learned how testing machine learning applications differs from testing in software development, its main issues, and how it differs from model evaluation. You also learned about different approaches to test your models.
Trying out and implementing different testing methods is not an easy task, especially if you want to integrate them within your overall machine learning monitoring framework. To save on time and resources, implement our-of-the-box solutions like Deepchecks.
Deepchecks provides you with an automated testing solution based on best practices and the latest research in the field.
Do you want to learn how? Book a demo, and we will show you!