How to Monitor ML Models in Production


Machine Learning (ML) models are complex. We trust these models with very important decision-making, and even our lives with the use of autonomous vehicles. Yet all too often, these models are deployed and then forgotten by data science teams. The data science team moves on to the next project, and the company only finds out about a critical error after it has caused significant damage.

In this post, we focus on monitoring ML models when they are fully deployed and running in production. Ideally, correct monitoring should help you detect a problem as soon as possible, and identify the source of it as well.

ML model monitoring dashboard enables you to visualize how your model is performing and detect potential issues early on (Source)

Why is model monitoring hard?

Monitoring ML models is not a straightforward task, so it’s usually not applied correctly for multiple reasons. One reason is that it is not so simple to define an error since ML models, by definition, give probabilistic results. Another is that it may not be possible to calculate the evaluation metrics on real-world data since true labels are usually not available (at least not in real-time).

ML systems are made up of code, data, and the model itself. Models depend on data, code, system configurations, and requirements to ensure everything runs smoothly. Other reasons that make it difficult to monitor include:

  • Entanglements. Models are not independent of the data they utilize, and when input data distributions change or new features are added or removed, it can lead to a shift in the features used for training the model. Models having non-deterministic (unpredictable) behavior make it difficult to track the exact problem when the model’s behavior changes, especially since it is dependent on data.
  • Pipeline jungles. This mostly appears during data preparation. As ML systems advance, many engineering teams do several ingestion and feature engineering tasks, resulting in the creation of distinct pipelines. It can be challenging to track down the issue and debug every pipeline when the model produces an incorrect out.
  • Configurations. Since model versions and hyperparameters are frequently governed by these configurations, minor flaws in the system configuration can cause the ML system to behave differently.

Read more about this in the Google paper “Hidden Technical Debt in Machine Learning Systems.”

Additionally, Machine Learning is still a relatively young technology, and the bridge between data science and DevOps is still being constructed.

What could break your model?

Getting good results on the development set is important, but it’s hardly enough. Running ML models in the real world has a variety of challenges that could cause model degradation.

Dev/Production data mismatch. Many ML models are trained on hand-crafted clean datasets. When these models are applied to real-world data, they have poor performance due to mismatch.

Data integrity issues. Think of the process the data goes through until being fed into your model. It may be coming from multiple sources – the format of the data may change over time, fields could be renamed, categories may be added or split, and more. Any such change can have a huge impact on your model performance.

Data drift and concept drift:Data in the real world is constantly changing. Social trends, market shifts, and global events affect every industry. These may affect the distribution of the data that is being fed to your model or that of the desired target prediction, so the data we’ve trained our model on becomes less and less relevant over time.

We will discuss various methods for detecting each of these potential issues and locating the source of the problem. To get a full picture of the model’s performance, we will need to monitor each of the relevant components over time, starting from the raw data, to engineered features, and finally to model performance.

Monitoring Your Model

ML model monitoring can be done in two ways:

  • Functional monitoring: Monitors data, model and predictions
  • Operational monitoring: Monitors system use and cost

Only the functional level of model monitoring, which emphasizes data and model monitoring, is covered in this article.

Data Monitoring

Monitoring your data and engineered features is essential for detecting when you might have an issue with your model, and identifying the source of the problem. Remember, your model is only as good as the data it has been trained on, so when there is a shift in the data we can’t expect the performance to remain as high as it once was.

“A Machine Learning model is only as good as the data it is fed.”

  • Reynold Xin, CTO at Databricks

Detecting Data Integrity Issues

This is a fairly simple step you could take that will save you a lot of heartache. Essentially, we want to validate whether the schema of the data in production is identical to the development data schema and that it does not change with time. This includes checking the consistency of feature names and data types, detecting new possible values for categorical data or new textual values, and identifying missing values.

The data pipeline can be very complex and there can be a multitude of causes for each of these changes. If a change like this in the data goes unnoticed, it’s bad news.

Column rename can really break your model, and you should be the first to find out about it (source)

assert df1.dtypes.equals(df2.dtypes)

This single line of code using a python library called Pandas will notify you if any column names have been changed/added/removed, or if any column data types have changed.

Detecting Data and Concept Drift

Data drift occurs when the probability of the input feature P(X) changes over time. This can happen either because of some change in the data structure (e.g., new gender option added in tabular data) or because of a change in the real world (e.g., stock market behaves differently during a crash)

To detect data drift, we compare the distribution of each feature independently (the joint distribution can be compared as well to detect more complex drifts), in the development data and in the production data.


Using the open-source tool Evidently, we analyze a toy example on the Iris flower dataset which can be found in the scikit-learn datasets by default. The dataset consists of four different features regarding the flower structure, and the object is to detect the type of iris out of three different types.

All coding examples are written using Python:

#install the evidently library
!pip install evidently
import pandas as pd
from sklearn import datasets
from evidently.dashboard import Dashboard
from evidently.tabs import DriftTab, CatTargetDriftTab
iris = datasets.load_iris()
iris_frame = pd.DataFrame(, columns = iris.feature_names)
#To generate the Data Drift report, run:
iris_data_drift_report = Dashboard(iris_frame[:100], iris_frame[100:], tabs = [DriftTab])"reports/my_report.html")
#To generate the Data Drift and the Categorical Target Drift reports, run:
iris_frame["target"] =
iris_data_drift_report = Dashboard(iris_frame[:100], iris_frame[100:], tabs = [DriftTab, CatTargetDriftTab])"reports/my_report_with_2_tabs.html")

Let us imagine that the first 100 examples are from the development data, while the last 50 are from the production data. The dataset is sorted by the target value, so we expect to see strong data drift and concept drift.

From the data drift report, we get the following:

As we can see, three of the four features have a significant drift in the distribution. This is an indicator that the model should be retrained, or at least evaluated on real-world data. Note that we did not need to have access to the real labels to detect data drift, so this could be implemented for any kind of setup regardless of the availability of real-world labels.

In the report regarding concept drift or target drift, we see that there is no overlap between the reference data and production data. If our model never learned to predict labels of a specific class during training, it will definitely fail to do so in production!

Recall that concept drift is a change in P(Y|X) over time. It is not enough to detect changes in the distribution of true labels over time since this can remain constant while there’s been a significant change in the relation between the features and the targets. In the report, we see the correlation between different features and the target in both development data and production data.

Generally, there may be changes in features that are not crucial to the model’s prediction. If we detect any significant concept drift, it is very likely that our model is making bad calls.

Concept drift happens when P(Y|X) changes over time. This can be caused by changes in data structure or a shift in real world data, which affects prediction quality indefinitely. For example, the advertisement click rate for a specific product may change dramatically when competition enters the market.

Types of Concept Drift

Gradual. This occurs naturally due to how dynamic the business world is. Users’ tastes may change or there may be a new set of input data introduced that has the potential to change the data pattern that results to this.

Sudden. This occurs when outlier events happen. It results in a sharp drop in model performance. A global crisis like an epidemic can cause this phenomenon to occur.

Blips. This type can be brought on by singular occurrences. It may occur when there is a problem with the system’s performance or when a customer’s actions are inconsistent with how the product is typically used.

Recurring. The primary reason for this is seasonality. This might be the result of a shift in customer behavior during a particular season in a certain location.

Performance fluctuations can follow different patterns, for example, the Covid-19 outbreak had a “sudden” effect on many ML models’ performance (source)

Detecting concept drift is done by comparing the joint distribution of individual features and the target. To measure the “distance” between distributions, statistical tests such as KS test and p-test are commonly used (Read more here).

Identifying the Pattern

Another important thing to note is that not every decrease in performance is an indication that your model is broken. Try to understand if your performance fluctuations follow a specific pattern (seasonality/response to financial crisis). If you can identify such a pattern, you may be able to create a more robust model that will have better overall performance.


How to Monitor ML Models in Production

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison

Model and Prediction Monitoring

The most straightforward way to monitor your ML model is to constantly evaluate your performance on real-world data. You could customize triggers to notify you of significant changes in metrics such as accuracy, precision, or F1. Model monitoring tools are best used to automate this process and reduce stress on the data science team. If you find that there is a significant decrease in these metrics, it may indicate that there is a critical issue with your data or with your model.

ML models tend to become stale over time, proper monitoring will tell you when performance decreases and it’s time to retrain your model (source)

Granular Monitoring

To gain more granular insights into your model’s performance, it is essential to continuously evaluate it on specific data slices and examine per-class performance as well. If your model is customer-facing, the ultimate business goal is to keep customers satisfied. This can be done by maintaining optimal model performance through monitoring and fixing any issues it might have. Additionally, you could automatically identify slices with low performance to gain insight and improve your models. We recommend checking out slicefinder for detecting problematic slices automatically.

Detecting slices with poor performance can help you make more robust models easily (source)

Detecting weak slices automatically with the Deepchecks system (source)

Monitor What Your Model Doesn’t Do

“Fix it – even if it ain’t broken”

When there is a significant decrease in your model’s performance, it’s probably time to update it. But what if the model is still “okay?” To ensure we are getting the best performance, we recommend continuously comparing our production model’s performance with potential new candidates.

Training your full model from scratch may be costly, so we suggest training some simple models such as Random Forest or XGBoost on new data as it flows in, and using drifts in performance or in feature importance as indicators for some significant changes in the data that call for retraining.

Shifts in feature importance can indicate that your model is underperforming. Source : Deepchecks

Monitoring the Unknown

By monitoring evaluation metrics, we are able to create a pretty good picture of the model’s status, especially when evaluating many different slices of the data independently. However, it is important to note that for many applications, the true labels are not available for production data in real-time. For cases like these, you will need to evaluate performance through proxy metrics such as comparing your model with a baseline model or an expert prediction.

Updating Your Model

You’ve monitored your model, you have identified significant concept drift, and when retraining your model you see significant improvements. Now it’s time to deploy the updated model. This whole process is a completely normal part of the lifecycle of ML models in production and you should consider best practices for making this process as smooth as possible.

A/B Testing for ML Models

Just as for any UX feature, we can use A/B testing for our ML models. This way, we are able to evaluate whether the newer model actually achieves better performance in scenarios where we don’t have full information. For example, a recommendation system such as the one used by Netflix knows if the user decides to watch a suggested movie, but cannot know if that user would watch some other movie that would have been suggested instead. In that case, we could run two models simultaneously and compare their success rate.

Additionally, using A/B testing can help you avoid issues when deploying a new model since it provides a more gradual transition. We can start off by directing some small percentage of the load to the new model and evaluate performance, then slowly increase this percentage for a smooth transition.

Classic UX A/B testing can be used to evaluate the performance of a new candidate ML model “in the wild” (source)

Automated Retraining

Once you’ve deployed a steady ML model, and you’ve gone through the process of retraining your model and deploying it again, it’s time to automate the process. In some scenarios where data changes are very swift, it could be worth trying a more risky online learning approach. Here, your model is updated whenever new examples are available.

On the left hand side we have an illustration of the more common practice of retraining on new data, on the right-hand side – an illustration of the online learning paradigm, we use the same model that is updated with time (source)

Keeping track of ML model life cycle made easy with Deepchecks system. source : Deepchecks


ML monitoring is an emerging field that is yet to be fully developed. We have seen various methods for monitoring your model and your data in production, detecting potential issues and identifying their root cause early on. The insight you develop using these methods will help you understand whether your data pipeline is broken, whether it’s time to train a new model, or whether you may continue working on your next project without worry.

Further Reading

Selecting the best time for intervention

Retraining process

Tests for shift in distribution

Online learning

Online learning

Concept drift

Concept drift survey

A/B testing for ML

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Recent Blog Posts