🎉 Deepchecks’ New Major Release: Evaluation for LLM-Based Apps!  Click here to find out more 🚀

Best Practices for Testing ML Pipelines

If you would like to contribute your own blog post, feel free to reach out to us via blog@deepchecks.com. We typically pay a symbolic fee for content that’s accepted by our reviewers.


When the machine learning (ML) model is ready to be deployed to a production ML pipeline, it is time to develop appropriate launch and production tests, detect failure modes, and evaluate the quality of the model in production. In ML, however, starting with evaluations takes time and effort. Tests are created based on the data, utilized ML model, and the specific problems that should be solved. They are required to validate input data, feature engineering, the quality of new model versions, serving infrastructure, and testing pipeline component integration.

Here are some of the various techniques and best practices.

Clean Tables

Numerous machine learning pipelines can be best represented as a directed acyclic graph of task functions:

ML pipeline as a directed acyclic graph of task functions

ML pipeline as a directed acyclic graph of task functions (Source)

Generally, all data-related operations such as cleaning individual tables, combining multiple tables, and aggregating to relevant dimensions are performed in the same task with no intermediate steps preserved. A Clean Tables approach for ML pipeline testing starts with splitting the data handling functions. The data handling process can be divided into two steps: Datafeeds and Dataprep. Datafeeds manages all activities associated with cleaning individual data tables as clean tables. After that, the Dataprep task handles the entire data combination and aggregation process. A presented technique is depicted in the figure below. Datafeeds functions clean and save individual tables (A, B, and C), then combine and aggregate those cleaned tables (A.clean, B.clean, and C.clean) using Dataprep functions to create the final table (D).

Data handling in an ML pipeline

Data handling in an ML pipeline (Source)

After the data handling functions are divided, they are used to generate test case tables from the clean tables. We can use data synthesis libraries like Synthetic Data Vault (SDV) to create a sampled, synthetic dataset from the original dataset while retaining the same statistical properties. For generating clean and test case tables, a modified data pipeline that only runs the Datafeeds functions (see figure below) can be used. So, after generating clean tables (A.clean, B.clean, and C.clean), the test case generates functions that will produce the test case tables (A.test case, B.test case, and C.test case)

Generating test case tables using clean tables

Generating test case tables using clean tables (Source)

The ML pipeline should be run on the test case tables from start to finish to deliver the expected results. After that, adopt a variant of the ML pipeline that runs on the test case tables and compare its results to the expected result for any subsequent feature branch. For this purpose, the Gitflow branching strategy could be applied. This step entails running two different pipelines through two branches (see figure below). First is the master branch, which should be run on test case tables to produce the desired results. The feature branch should then run on the same tables. Nevertheless, it should run an assert function in its final step that reads the expected result obtained from the master branch, and compare it to the actual result generated by the future branch of the run.

Running tests

Running tests (Source)

Testing Pyramid

This other example of a traditional simplified software test  begins with many Unit Tests that test a single software functionality independently of others. Integration Tests are then used to ensure that connecting our isolated components within a pipeline works as expected. User Interface (UI) or Acceptance Tests are conducted to ensure that the application is functioning as expected from the user’s perspective.

Testing pyramid

Testing pyramid (Source)

Within a data products manner, testing ML models pyramid looks slightly different. A Unit Test has to be written for each piece of independent functionality, ensuring that every part of the data transformation process has the expected effect on the data. Consider different scenarios for the individual piece of functionality that would be run as components of the continuous integration (CI) pipeline. CI methodology implies that developers push code updates several times every day. You can write a set of scripts to automatically build and test your application for each push to the repository. Using these scripts, you can minimize the possibility of your application containing errors. A Unit Test can assist in debugging. We can add a test that reproduces a newly discovered bug, use that information to fix the bug and, in that way, ensure it does not reoccur.

ML testing pyramid

ML testing pyramid (Source)

Integration Tests are designed to determine whether or not components created separately function as expected when put together. A machine learning data pipeline must ascertain that the data cleaning process produces a dataset suitable for the model, and that the model training can manage the data provided to it and outputs results (ensuring that code can be refactored again). These tests will not be as common as Unit Tests, but they will still be part of your CI pipeline. These tests should be used to assess the end-to-end functionality of a component and consequently evaluate more complex scenarios.

When it comes to product development, the outcomes of an ML model rarely match the desired results. Before being consumed by a user or another application, these results are usually combined with other business or end-user requirements for the specified problem. As a result, we must confirm that the model solves the user problem, not just the sufficiency of the accuracy, f1-score, or other statistical measure. This type of validation should be performed regularly, and the results should be made available to the organization. It guarantees the organization sees progress in the data science components and that issues caused by changes or stale data are identified early.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Continuous Integration/Continuous Deployment Framework

As the name implies, this approach requires you to first implement your ML pipeline using a Continuous Integration/Continuous Deployment (CI/CD) framework. This method lets developers produce higher-quality ML models faster by automating the whole process throughout the development, testing, production, and monitoring phases of the ML applications development lifecycle.  With CI,  put all of your source code under version control (i.e. GIT, Visual Studio Team Services, Team Foundation Version Control). Keeping track of the code will be easier and less time-consuming with version control software (especially as your codebase and team grow). Models must be versioned using the model registry.

Once your application has been versioned, include continuous integration by incorporating automated machine learning pipeline testing into your CI/CD pipeline. Each time you check your code, verify that each test passes. This is crucial for developing dependable software and machine learning pipelines.

After establishing a robust CI/CD pipeline, we recommend concentrating on adding logging. Logging, as discussed in the tracking section, is an essential component of any ML pipeline that will help you achieve clarity and reliability. Logging is divided into two categories: external and internal. External logging records refers to  which model is applied to which data, while internal logging lets you see your ML pipeline’s inner workings and quickly debug problems.

Monitoring your machine learning pipeline is crucial for gaining business value. Keep track of performance metrics like uptime, calculation time, and latency. It would help if you also tracked how frequently the ML pipeline generates actionable insights. Keep it simple when monitoring your ML pipeline and concentrate on operational (will help you with the daily operation of your ML pipeline) and performance metrics (measure the business value of the ML model). A simple ML pipeline with version control, logging, active performance monitoring, and a well-established machine learning CI/CD pipeline process is a great place to start.

ML Model Testing Platforms

Companies provide platforms to test ML services, making ML pipeline designing conveniently automated and more accessible.

  • AutoML Platforms

Deepchecks provides a flexible, extendable and editable open-source python library used by ML engineers and data scientists that includes extensive test suites for machine learning models and data. Test suites are formed of checks that could be edited or removed, and they contain conditions with a pass/fail outcome or outputs to show in a notebook. Conditions also can be added or deducted from a check. The library can support numerous validation and testing scenarios and evaluate the performance of the model. Their business model is established on setting for the extra features that are generally crucial for models relevant to internal business applications or deployed in production.

H2O.ai created an open-source machine learning platform called H2O. This platform supports all major preferred languages like R, Python, Java, and Scala and has customizable versions of machine learning and deep learning algorithms. It has an excellent model selection framework, including the ability to form stacked ensembles. H2O AutoML automates tasks such as model validation, model tuning, model selection, and model deployment. Its main features, such as scripting in multiple languages and a Web GUI, are available for parameter passing, allowing AutoML tasks to be customized. It also provides native support for many data preparation tasks like missing value imputation and transformations. Good integration with big data platforms such as Hadoop and Spark is another preference. In terms of user experience, H2O provides a sizable and engaged user and contributor base.
H2O ai platform

H2O.ai platform (Source)

  • Auto-Sklearn

This platform provides 15 classification algorithms and 14 feature engineering pipelines. Auto-Sklearn is based on the popular scikit-learn library that builds an ensemble of all models tested during the global optimization to extend the concept of configuring a general ML framework with efficient goal optimization. Bayesian Hyperparameter Optimization via Meta-Learning, which uses prior optimization knowledge from various datasets to identify the best algorithm-hyperparameter combination, is a part of their service. It handles scaling, encoding, and missing values natively. One downside of this platform is that ensembling occasionally fails to converge, and the meta-feature computation process can be computationally demanding and time-consuming.

Auto Sklearn platform

Auto-Sklearn platform (Source)

  • GitLab CI/CD

GitLab CI/CD is a free, open-source application written in Go and Ruby and distributed under the MIT license. It is used to detect bugs and errors early on in the model development cycle and ensures that all the code deployed to production complies with the code standards you established for the pipeline. For automatic building, testing, deploying and monitoring ML applications, this platform uses Auto DevOps.

GitLab CI/CD platform

GitLab CI/CD platform (Source)


There are plenty of ways to test an ML pipeline. A single high-quality data-driven approach can be used to determine the code performs as expected. It should ensure that we can refactor and clean up and keep track of code without interfering with the product’s functionality, decisions, and previous bugs, and see the progress and state of a product’s ML components.

In general, people, teams, or even large organizations have the diverse abilities, knowledge, and experience needed to build a successful machine learning pipeline. It is necessary to always keep things straightforward. If your initial ML pipeline falls short of your business goals, DO NOT GIVE UP.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo