Machine Learning Testing Principles: Making Sure Your Model Does What You Think It Should Do

Intro

Machine Learning models are becoming increasingly popular, they are widely used software systems that play an important role in a variety of industries. However, classic practices for testing and QA of software systems are not a perfect fit for ensuring that our ML models perform the way we believe they should.

In this post, we will try to cover some of the basics for automated testing of your ML models in a way that may help save time and effort on debugging your model when it might already be causing significant damage. We will be focusing on tests that can be applied to a model at the post-training stage, but it is important to note that you should read up on related subjects such as monitoring your model in production, and validating your data, as well.

Classic unit testing does not apply to ML models

One reason it can be hard to test ML models is that they are complicated objects. Classic software systems can be decomposed into simple units where each unit has a well-defined task for which we may ensure correct performance with a small number of tests. On the other hand, an ML model is a somewhat large unit, that is the product of a process of training, which cannot be decomposed.

Another reason that machine learning testing is different from classic software testing is that in ML testing we are trying to test something that is inherently non-deterministic, and probabilistic. Our model can make mistakes sometimes, yet still, be the best possible model. On the other hand, in classical software, there is no tolerance for any incorrect output, and thus we can test for such errors in a more straightforward manner.

Thus, we have shown that there is a need to define a methodology for QA of ML systems. We will discuss how we could test our models to ensure that they fulfill what is required of them. We will divide the different methods into black-box testing and white-box testing.

Many of the technical terms and ideas are borrowed from the paper: Beyond Accuracy: Behavioral Testing of NLP Models with CheckList, and from here.

Black Box Testing

General Evaluation

The most widely used “tests” for ML systems are the evaluation metrics (e.g. accuracy, precision, F1, AUC) on the test data. These metrics are important, and indeed if we can maintain high results in production our model is probably doing what it’s meant to do. If, on the other hand, if the results deteriorate, then we know something is wrong.

However, these results paint a very high-level picture. They do not give us information about what may cause the problem. Additionally, evaluating on the full test set can be computationally expensive, and so we might want to run this test less frequently than some other less expensive tests.

General metrics are necessary but not sufficient for understanding our model’s performance. (source)

 

Manual Error Analysis

In contrast to the high-level picture of our overall performance that is provided from the evaluation metrics, it is important to dive into the details and sample some examples that our model gets wrong and investigate why our model predicts the wrong value for these examples.

After going through multiple examples, try to detect patterns: Are the errors similar to each other? Why is the model making mistakes? Are there issues with the input data? Would a human make a similar mistake? Answering these questions is essential to get a deep understanding of your model’s capabilities.

The confusion matrix can give us an indication of the most common errors (source)

 

Additionally, it can be helpful to cluster the data points and visualize the data (using PCA or t-SNE for example). This can help you understand if your model performs poorly on a certain region of the data.

visualization of clusters of the MNIST dataset using t-SNE (source)

 

Note that some digits can look similar to others, and our model is more likely to make mistakes on the border between the “1” cluster and the “7” cluster for example

Although this post is focused on automated tests, we recommend running at least one iteration of this manual stage, since familiarity and direct contact with your model and data is essential for creating quality ML models.

Important: Error analysis should be done on a validation set and not on the hold-out set, in order to prevent data leakage.

 

Naive Single Prediction Tests

If you think about it, perhaps the most straightforward analogy to unit testing for ML model testing is to provide a sample that you require the model to predict correctly and assure that the prediction is correct. For example, if we created a model for sentiment analysis of movie reviews we would expect the text “One of the best movies of 2010” to be classified as positive. For such a test we should choose an example that is “easy” and non-ambiguous, an example that we say that if our model gets wrong, it shouldn’t be in production.

The problem with such tests, is as mentioned earlier, that ML models are allowed to make mistakes. What we usually care about is the global performance and not the prediction of a single example. And so, if we wrote such a test, our model might fail after retraining for no particular reason. That is why we recommend using “easy” examples that you really shouldn’t get wrong.

Example code for running with pytest:

def test_negative_sentiment(model):
   text = "This movie was a total waste of time"
   sentiment = model.predict(text)
   assert sentiment == 0


============================= test session starts =============================
test_sentiment_analysis_model.py::test_negative_sentiment PASSED                                     [100%]
============================== 1 passed in 0.01s ==============================

 

One slightly more sophisticated option is creating templates for simple examples and then evaluating the results of the model on many examples that match this template. This is referred to as the “Minimum Functionality test” (Ribeiro et al.)

 

Directional Expectation Tests

The next kind of test enables us to define the expected effects of some data perturbations on the output. For example, when trying to predict the value of a house, we would want to assert that our model predicts a higher price for a larger house when other attributes remain constant.

This kind of test is useful because it is comparative as opposed to absolute. This enables us to automatically create tests from templates without needing to define the expected prediction value for any example, and thus we can easily create tests with wide coverage.

For sentiment analysis, we want to ensure that adding positive content makes the prediction more positive, and vice versa for negative content (Ribeiro et al.)

Invariance tests

Similarly, we may define data perturbations that should not affect the model output. This resembles data augmentation, however here we discuss it as a means of testing our model. For example, in an NLP task, swapping a word with its synonym should not affect the output dramatically. Similarly, when creating a “fair” model that is meant to be blind to attributes such as gender or race, we can test to see that changing these features does not affect the prediction in a meaningful way.

Augmenting images of a dog should be classified as “dog” as well (source)

Example: Swapping the destination should not affect the sentiment prediction (Ribeiro et al.)

Evaluation of data slices

As we’ve discussed earlier, general evaluation gives us a very high-level picture of our results. How can we get a more fine-grained understanding of our results on different types of examples?

The idea is fairly simple – use data slicing to create many different subsets of your dataset, and then evaluate each subset separately. After gathering this information we can then investigate slices with low performance, and understand the underlying causes for non-optimal results. This is a relatively inexpensive process that can help you increase performance significantly.

Detecting critical subsets can help you improve your model quality quickly (source)

 

We recommend checking out the snorkel library, which has features for customizable data slicing and evaluating on each slice. Additionally, we recommend checking out the paper “Slice Finder: Automated Data Slicing for Model Validation”, and the corresponding github repository for automatic selection of slices with poor performance.

Automatic detection of slices with bad performance using “Slice Finder” (source)

White Box Testing

Explainability

Explainable ML systems give us not only the what but also the why. They provide an explanation for their predictions. The definition of an “explanation” might be a question for the philosophy department, but there are some commonly accepted definitions that are used in practice. For example, a decision tree is considered to be self-explanatory since we can understand the process of the prediction by traversing the edges of the tree that correspond to the different conditions.

Decision trees are “self-explanatory” models (source)

 

In Computer Vision, a common practice is to create a heatmap of the pixels that affect the prediction the most as an “explanation”.

The red pixels have the largest effect on the classification of the image as a “cat”, and therefore provide an explanation for the prediction (source)

 

For models that are less explainable (such as DNNs) we can try to use knowledge distillation to provide an equivalent model that is explainable, or we can train our model to output an explanation to go along with the prediction.

Why is this important?

In the context of testing ML models, it is important to understand the inner logic of the model and how different features affect the prediction. This can prepare us for possible future bugs, and help us make more robust ML models.

Examining Weights During Training

Normally, when we see that our model is learning, the loss is decreasing and the accuracy is increasing, we are then satisfied. But there are some insights that can be derived from examining the weights of the network during training that are worth checking out.

Try to detect the bug in the following piece of code for example:

def make_convnet(input_image):
   net = slim.conv2d(input_image, 32, [11, 11], scope="conv1_11x11")
   net = slim.conv2d(input_image, 64, [5, 5], scope="conv2_5x5")
   net = slim.max_pool2d(net, [4, 4], stride=4, scope='pool1')
   net = slim.conv2d(input_image, 64, [5, 5], scope="conv3_5x5")
   net = slim.conv2d(input_image, 128, [3, 3], scope="conv4_3x3")
   net = slim.max_pool2d(net, [2, 2], scope='pool2')
   net = slim.conv2d(input_image, 128, [3, 3], scope="conv5_3x3")
   net = slim.max_pool2d(net, [2, 2], scope='pool3')
   net = slim.conv2d(input_image, 32, [1, 1], scope="conv6_1x1")
   return net

 

Perhaps you noticed that the layers are not really stacked, since we use input_image as the input to each layer. But how would we detect this error? The code would run smoothly, and even the learning process would seem to work. However, we would get poor results since in practice there is only a single convolutional layer in our network.

If we would inspect the weights of the network after each epoch we would see that most of them remain at the exact same state.

 

def test_convnet():
   image = tf.placeholder(tf.float32, (None, 100, 100, 3)
   model = Model(image)
   sess = tf.Session()
   sess.run(tf.global_variables_initializer())
   before = sess.run(tf.trainable_variables())
   _ = sess.run(model.train, feed_dict={
   image: np.ones((1, 100, 100, 3)),
   })
   after = sess.run(tf.trainable_variables())
   for b, a, n in zip(before, after):
   # Make sure something changed.
   assert (b != a).any()

Example test code to make sure that the weights change after training

Conclusion

Testing is an essential part of any software development process, however, in the field of ML there is still no standard practice that is widely accepted. We have shown some practices that enable you to test your ML models and make sure they do what you expect of them. We hope you enjoyed reading this post, and let us know if you have any thoughts on the subject.

Further reading

How to unit test machine learning code?

DeepXplore: Automated Whitebox Testing of Deep Learning Systems

Ribeiro, M., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4902–4912). Association for Computational Linguistics.

 

Subscribe to our newsletter

Do you want to stay informed?
Keep up-to-date with industry news, the latest trends in MLOps, and observability of ML systems.

Subscribe to our newsletter:

Related articles

Test Your Machine Learning Models
How to Test Machine Learning Models

Data Drift vs. Concept Drift
Data Drift vs. Concept Drift: What Are the Main Differences?

Automating Machine Learning Monitoring: Best Practices