ML Testing: Best Practices and Implementations

This blog post was written by Tonye Harry as part of the Deepchecks Community Blog. If you would like to contribute your own blog post, feel free to reach out to us via We typically pay a symbolic fee for content that's accepted by our reviewers.


Every Machine Learning (ML) team envisions a product that passes the test of time and achieves its set goals or key performance indicators. However, the degradation over time coupled with a dynamic world where both government and corporate policies change constantly. It makes this dream hard to achieve.

It is important to regularly test and tweak ML applications to see if they perform at the level they were intended to over time. To accomplish this, ML teams have to look at testing ML applications on different levels namely:

These various components of an ML application have various testing methods and if testing ML applications is new to you, you can learn about it. Quality assurance is the primary goal of testing. Its benefits can range from cost reduction, early detection, optimal performance, and easy maintenance over time.

This article looks at the best practices for testing the code and testing ML pipelines which include checking the data and ML model testing.


An ML system consists of both deterministic and non-deterministic components. A data science/ML practitioner or team should be aware of the different code tasks which make the deterministic component, especially in building ML pipelines, and test them appropriately. These code tasks are used to:

  1. Ingest and clean raw data task
  2. Generate features
  3. Select and train a model
  4. Validate the model
  5. Package the model
  6. Serve the model

Source: edublancas

The basic reason to test the code is to check if it runs as it is expected to. There are different levels of categories of testing which include the following:

– Unit Testing:

This involves testing the smallest functional components of an application. Each part of the code is expected to work as it should.

Based on this neural networks code example:

class NeuralNetwork:
    def __init__(self, n_layers): = "NN"
        self.n_layers = n_layers
        self.model = None
    def train ( self , dataset ) :
        self.model train_model ( dataset )
    def compute_accuracy ( self , dataset ) :
        return self.model.compute_acc ( dataset )
    def train_model ( dataset ) :
# code to train a model given a dataset

A very simple unit test might look like this:

def nn5():
    return NeuralNetwork(n_layers=5)

def test_unit_create_nn(nn5):
    assert == "NN"
    assert nn5.n_layers == 5
    assert nn5.model is None

Source: Miguel Gonzalez-Fierro

– System Testing:

It tests the design of the fully integrated system to ensure expected outputs given inputs. These tests can be functional or non-functional like smoke tests (health checks), security tests, performance tests, etc.
A simple smoke test for the above example of the neural network might look like this:

def load_dataset(filename):
    # code to load a dataset from a file

def test_smoke_train():
    nn = initialize_NN()
    dataset = load_dataset('data.csv')
    assert nn.model is not None

Source: Miguel Gonzalez-Fierro

– Integration Testing:

This involves the logical integration of modules and testing as a group. At this level, the testing is used to expose defects impeding fluid integration between modules after being integrated.


– Regression Test:

Testing if previous errors are still present in recent code changes. This ensures that these errors are not reintroduced.


ML Testing: Best Practices and Implementations

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison

Best Practices

The same coding and testing standards expected from a software engineer are the same expected standards for a data science/ML team. If an ML system becomes problematic, no one cares if it was the software engineers or the ML practitioners who did not test the code. Testing should be a key component of the development process.

Use Tests in Small Fractions:

When creating each unit component of code, we need to make sure that each unit has a single responsibility so that it can easily be tested, especially in small fractions. If this is not the case, the code will need to be split into atomic units and each tested to ensure that they work properly.

Compose Tests for Each Functionality:

In order to catch errors early on, it is imperative that when new components are created, and accompanying test should be created to validate the functionality. This helps ensure code reliability and you can easily trace the problems.

Always Conduct a Regression Test:

With a regression test, a team can account for new errors that happen with every new code input, and the same errors are not reintroduced later with each version we create.Ensure

Maximum Coverage:

Testing 100% of the application might not be possible but if it is, go for the ideal. It is expected that every line of code is accounted for. Writing test cases for logical and unexpected conditions or behaviors under the test can help with maximum coverage.

Source: Pipe drive


Automated testing provides your team with rapid feedback and it is a less expensive way to detect errors early on. This helps in the event your team forgets to run a test before committing to a repository. The desired goal is to automatically run tests for each commit.

It is very important for the team to have a log for all tests and observations in a file and incorporate the observations into a final test report. Clean test code is often required and should be aimed for. A popular package for testing application code is Pytest. Every team has its own preference but the goal is to ensure the quality of the code.


After testing the different tasks or functions that act on the system, the next layer to test will be the validity of the data. This is very important in testing ML pipelines because these pipelines depend heavily on the data it utilizes. Depending on the ML use case, the data needed for a pipeline might differ. Generally, data is expected to be labeled correctly, cleaned, and structured in a way that the model can accurately generalize based on the expected outcome.

In data validation testing, we have expectations of the data and these expectations are what we test or validate for quality assurance and to ensure better performance of the model. Great Expectations can be a valuable open-source library to implement expectations for validating data. Problems with the data can stem from faulty labeling, dataset dimension, poor data distribution, dataset properties, etc.

For example, detecting outliers early on will save time and improve the model. Outliers can be created during data entry and sampling. Statistical models generalize based on data it is given, and when outliers are present, it can in some cases impede the accuracy of the model. On a large scale this can cost lives or money.

Source: Author

Data validation can be implemented in two ways:

  • Manually composing assert or if-statements to track expectations
  • Leveraging the easy-to-use open-source deepchecks package for data validation testing.

Assert Expectations

This method of data validation simply uses assert statements to track expectations of the data. Some teams still use this but in the case we are building an ML pipeline and depending on the environment used, assertions might be disabled and can lead to serious bugs in the code. If manually doing this process is your forte, using if-statements to raise exceptions might be a better alternative to assert statements. We do not recommend this method of validation because it is difficult to scale for large ML projects.

Deepcheck’s Data Validation Tools

Deepchecks is an open-source package that enables data science practitioners to efficiently validate Machine Learning models and the data used in creating them. The importance of this package arises from the inherently intertwined relationship between data and the model in a typical ML development process/framework. It provides users with checks to validate the data or model at each stage of the ML development process. It has checks and suites

This package has different data checks and when they are used correctly, provide outputs with information about the data. This could be details about data distribution, integrity, or just information about the structure of the data depending on what the data science/ML team is looking to validate.

Validation suites on the other hand are a collection of checks with a holistic diagnosis of the data of interest and the exciting thing is that conditions can be added to these checks for more flexibility. Check out the Quickstart in 5 minutes tutorial.

Considering a computer vision use case, data science teams can utilize deepcheck’s vision package to check dataset dimensions and label drift among other things.

Dataset Dimensions: Integrity check and check if the data is correct

To derive metadata from the dataset of choice, it has to be transformed into a format recognized by Deepchecks. While in the preprocessing stage, the ImageDimensionsCheck package can check, for example, the distributions of the image dimensions and the number of bounding boxes per image. This invariably can help detect outliers in labeling or in other cases to understand strange behaviors in the model of choice.

Source: Author

Label Drift: Checking if the data was split correctly

Sometimes, there is a high likelihood of having uneven distributions of the target label after splitting data into train and test. This mistake can reduce the model performance on the test set, and we don’t want that to happen to data science teams. Deepchecks built the TrainTestLabelDrift package to enable data scientists and ML engineers to detect label drift between the distributions of the train and test datasets. It uses metrics like Drift Score – Earth Mover’s Distance, Bounding box area distribution, Drift Score – PSI, Sample per class, and bounding box per image.

Source: Author

Best Practices

– For any ML pipelines, teams are encouraged to test every data output from each step of the process. This includes cleaning, preprocessing, augmentation, for NLP, and tokenization among other processes. Expectations created can be used to check incoming data and check to see how they perform with new data.

It is also advised to carry out unit tests on the dataset. In a complex data pipeline, expectations should be easily scalable for each downstream process in the pipeline to save time and resources. Apart from data validation checks, Deepchecks integrates effortlessly with Pytest to enable ML/data science teams to utilize deepchecks inside unit tests performed on data.

import pytest
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

from deepchecks import Dataset
from deepchecks.tabular.checks import TrainTestFeatureDrift
from deepchecks.tabular.suites import data_integrity

def diabetes_df():
    diabetes = load_diabetes(return_X_y=False, as_frame=True).frame
    return diabetes

def test_diabetes_drift(diabetes_df):
    train_df, test_df = train_test_split(diabetes_df, test_size=0.33, random_state=42)
    train = Dataset(train_df, label='target', cat_features=['sex'])
    test = Dataset(test_df, label='target', cat_features=['sex'])

    check = TrainTestFeatureDrift(columns=['age', 'sex', 'bmi'])

    result =, test)

    assert result.passed_conditions()

– It is also good practice to document all the tests done on the data.


Testing the model is next on the list after the team is sure they have tested the code and data. This usually takes place before training, after training, evaluating, inference, and deploying the model. The cost of not testing, validating, and monitoring a model is far greater compared to when teams pay attention to the health of their models.

In 2019, a study found out that a healthcare prediction algorithm used by some hospitals in the US was racially biased against black patients. This algorithm placed a group of black patients who were sicker than white patients on the same level of risk, this depriving a number of black people the extra care they needed. Another example to stress the importance of testing is the incident in 2016 where Microsoft trained a chatbott that became racist due to racist input it was receiving.

Before we go further, it will be of great importance to define terms mostly used in this space in order to clarify any misconceptions.

Model evaluation vs testing vs monitoring

Model evaluation: It is used to check the performance using metrics and plots that tell how well it does on the validation or test sets.

Model testing: : ML model testing is strictly done to check the behavior expected of the model.

Model monitoring: Monitoring is a continuous process done to ensure that the quality of the ML pipeline system is maintained and continues to pass on live production data while at the same time ensuring that the distribution of the data matches or is comparable to the reference window.

These activities are integral to the overall performance and lifespan of an ML system or application. In essence, to build a high-quality model, these are required. If they are not done, errors are bound to occur. Model testing can provide an avenue for a systematic approach to error analysis.

ML Model Testing

There are two classes of model testing based on when the test is done. They are:


Source: StockVector

Tests before training the model enable teams to detect bugs early. There are various tests that can be run at this stage.

  • The team can check the shape of the model output and the labels. It is expected that the labels in the dataset being used align with the model output.
  • Same checks can be done for the classification output.
  • Label leakage checks
  • Previous dataset assertions and validations can be useful here.
  • Depending on the ML use case, checks can be created for each desired expectation.

Like most tests, its overarching goal is to ensure that errors are identified and reduced to ensure time and resources are spent profitably before training the model.


At this point, the tests in the post-training phase are to understand the behavior of the model. These tests investigate the logic learned during training and can present the practitioner with valuable behavioral information on model performance.

The importance of this phase arises from the observation that a model might not throw any error when tested on new data but that doesn’t mean it can be used for production. It can produce incorrect outputs that can cost your business. These three behavioral tests include:

Source: Khuyen Tran

Minimum Functionality Test (MFT):

This can be classified as a unit test. It aims at isolating and testing a specific behavior. It is implemented by generating a sizeable number of examples to detect different failure modes. In scenarios where the prediction errors might cost the business, this method might help to identify these critical scenarios.

Invariance Test (INV):

This test checks if the model prediction is fixed when different perturbations are introduced. These changes can be used to create examples to test with in order to check for consistency in model predictions. Augmentations are closely related to this type of test.

Directional Expectation Test:

In this test, the output can be defined or predictable no matter the changes made to the input.

Best Practices

  • After doing desired behavioral tests to check the quality of the model, it is advisable to manually do a quality assessment. This manual assessment is important to increase the performance and reduce the likelihood of error because there are model properties that are more difficult to test automatically but can be visible through when a human inspects it.
  • If errors occur when the model is served, make sure that the training and serving code is preprocessing the data in the way it is expected.
  • Use testing frameworks based on the use case of the model. Some might be great for NLP but different in recommendation system projects.
  • Start small, break the functionalities into functional parts and run simple tests before you go onto more advanced frameworks.


Testing is not as easy as it seems especially when running big Machine Learning projects that require lots of data and long training cycles. Time is a big challenge as you might spend a reasonable amount of time running a test. If this might be the case, go for a testing strategy or method which is simple or favorable such as smoke tests and testing with small random samples. Start with the simple tests then gradually move up to the more advanced ones. Modularize your code to ensure easy testing.

ML applications and systems require lots of attention and effort to ensure quality; from evaluation to testing and monitoring. Deepchecks is a package that can reduce the stress on your team in order to put more focus on other aspects of the projects. You can check out all we do and start using deepchecks and its integrations for most of your ML testing needs.

To explore all the checks and validation in deepchecks, go try it out yourself! Don’t forget to ⭐ their Github repo, it’s really a big deal for open-source-led companies like Deepchecks.


ML Testing: Best Practices and Implementations

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison

Recent Blog Posts

LLM Evaluation With Deepchecks & Vertex AI
LLM Evaluation With Deepchecks & Vertex AI
The Role of Root Mean Square in Data Accuracy
The Role of Root Mean Square in Data Accuracy
5 LLMs Podcasts to Listen to Right Now
5 LLMs Podcasts to Listen to Right Now