ML models are complex entities. We trust these models with some very important decision-making, we may even start trusting these models with our life in applications such as autonomous vehicles. Yet all too often, these models are deployed and then forgotten. The data science team moves on to the next project, and the company only finds out about a critical error after it has caused significant damage.
In this post, we will focus on monitoring machine learning models when they are fully deployed and running in production. Ideally, correct monitoring should help you detect when there’s a problem with your model as soon as possible, and identify the source of the problem as well.
ML model dashboard enables you to visualize how your model is performing and detect potential issues early on (Source)
Why is it hard?
Monitoring ML models is not such a straightforward task, and therefore it’s usually not applied correctly, for multiple reasons. One reason for this is that it is not so simple to define an error since ML models by definition give probabilistic results. Another reason is that it may not be possible to calculate the evaluation metrics on real-world data since true labels are usually not available (at least not in real-time). And finally, Machine Learning is still a relatively young technology, and the bridge between Data Science and DevOps is still being constructed.
What Could Break Your Model?
Getting good results on the development set is important, but it’s hardly enough. Running ML models in the real world poses a variety of challenges, which may cause model degradation.
Dev/Production data mismatch: Many ML models are trained on hand-crafted clean datasets. When these models are then applied to real-world data they have poor performance due to this mismatch.
Data integrity issues: Think of the process the data goes through until being fed into your model. It may be coming from multiple sources, the format of the data may change over time, fields could be renamed, categories may be added or split, and more. Any such change can have a huge impact on your model performance.
Data drift and concept drift:
Data in the real world is constantly changing. Social trends, market shifts, and global events affect each and every industry. These in turn may affect the distribution of the data that is being fed to your model, or that of the desired target prediction. And thus the data we’ve trained our model on becomes less and less relevant over time.
We will discuss various methods for detecting each of these potential issues, and locating the source of the problem automatically. In order to get a full picture of the model’s performance we will need to monitor each one of the relevant components over time, starting from the raw data, to engineered features and finally to model performance.
Monitoring Your Model
The most straightforward way to monitor your ML model is to constantly evaluate your performance on real-world data. You could customize triggers to notify you when there are significant changes in metrics such as accuracy, precision, or F1. If you find that there is a significant decrease in these metrics, it may indicate that there is a critical issue with your data or with your model.
ML models tend to become stale over time, proper monitoring will tell you when performance decreases and it’s time to retrain your model (source)
To gain more granular insights into your model’s performance, it is essential to continuously evaluate your model on specific data slices, and examine per-class performance as well. If your model is customer-facing, you will want to ensure that your most loyal customers are having a good experience. Additionally, you could automatically identify slices with low performance in order to gain insight and improve your models. We recommend checking out slicefinder for detecting problematic slices automatically.
Detecting slices with poor performance can help you make more robust models easily (source)
Detecting weak slices automatically with the Deepchecks system (source)
Identifying the Pattern
Another important thing to note is that not every decrease in performance is an indication that your model is broken. Try to understand if your performance fluctuations follow a specific pattern (e.g. seasonality/response to financial crisis), and if you are able to identify such a pattern, you may be able to create a more robust model that will have better overall performance.
Performance fluctuations can follow different patterns, for example, the Covid-19 outbreak had a “sudden” effect on many ML models’ performance (source)
Monitor What Your Model Doesn’t Do
“Fix it – even if it ain’t broken”
When there is a significant decrease in your model’s performance, it’s probably time to update it. But what if the model is still “okay”? In order to ensure we are getting the best performance we can, we recommend continuously comparing our production model’s performance with potential new candidates.
Training your full model from scratch may be a costly operation, and so we suggest training some simple models such as Random Forest or XGBoost on new data as it flows in, and using drifts in performance or in feature importance as indicators for some significant changes in the data that call for retraining.
Shifts in feature importance can indicate that your model is underperforming. Source: Deepchecks
Monitoring the Unknown?
By monitoring evaluation metrics, we are able to create a pretty good picture of the model’s status especially when evaluating many different slices of the data independently. However, it is important to note that for many applications the true labels are not available for production data in real-time. For cases like these, you will need to evaluate performance through proxy metrics such as comparing your model with a baseline model or an expert prediction.
Monitoring Your Data
Monitoring your data and engineered features is essential for detecting when you might have an issue with your model, and identifying the source of the problem. Remember, your model is only as good as the data it has been trained on, and so when there is a shift in the data we can’t expect the performance to remain as high as it once was.
“A machine learning model is only as good as the data it is fed”
- Reynold Xin, CTO at Databricks
Detecting Data Integrity Issues
This is a fairly simple step you could take that will save you a lot of heartache. Essentially we want to validate that the schema of the data in production is identical to the development data schema and that it does not change with time. This includes checking the consistency of feature names and data types, detecting new possible values for categorical data or new textual values, identifying missing values, and more.
The data pipeline can be very complex, and there can be a multitude of causes for each and every one of these changes. If a change like this in the data goes unnoticed, it’s bad news.
Column rename can really break your model, and you should be the first to find out about it (source)
This single line of code using pandas will notify you if any column names have been changed/added/removed, or if any column data types have changed.
Detecting Data Drift and Concept Drift
Data drift: When P(X) changes over time. This can happen either because of some change in the data structure (e.g. new gender option added in tabular data) or because of a change in the real world (e.g. stock market behaves differently during a crash)
Concept drift: When P(Y|X) changes over time. This too can be caused by changes in data structure or by change in reality but affects prediction quality indefinitely. For example, the advertisement click rate for a specific product may change dramatically when competition enters the market.
In order to detect data drift, we compare the distribution of each feature independently (the joint distribution can be compared as well to detect more complex drifts), in the development data and in the production data. Similarly, detecting concept drift is done by comparing the joint distribution of individual features and the target. In order to measure the “distance” between distribution, statistical tests such as KS test and p-test are commonly used (Read more here).
Using the open-source tool evidently, we will analyze a toy example on the Iris flower dataset which can be found in the scikit-learn datasets by default. The dataset consists of four different features regarding the flower structure, and the object is to detect the type of iris out of three different types.
import pandas as pd from sklearn import datasets from evidently.dashboard import Dashboard from evidently.tabs import DriftTab, CatTargetDriftTab iris = datasets.load_iris() iris_frame = pd.DataFrame(iris.data, columns = iris.feature_names) #To generate the Data Drift report, run: iris_data_drift_report = Dashboard(iris_frame[:100], iris_frame[100:], tabs = [DriftTab]) iris_data_drift_report.save("reports/my_report.html") #To generate the Data Drift and the Categorical Target Drift reports, run: iris_frame["target"] = iris.target iris_data_drift_report = Dashboard(iris_frame[:100], iris_frame[100:], tabs = [DriftTab, CatTargetDriftTab]) iris_data_drift_report.save("reports/my_report_with_2_tabs.html")
In this demo, we imagine that the first 100 examples are from the dev data, while the last 50 examples are from the production data. The dataset is sorted by the target value and so we expect to see strong data drift and concept drift.
From the data drift report we get the following:
As we can see, in three of the four features we see a significant drift in the distribution, which would be an indicator that the model should be retrained or at least evaluated on real-world data. Note that we did not need to have access to the real labels to detect data drift, and so this could really be implemented for any kind of setup regardless of the availability of real-world labels.
In the report regarding concept drift, or target drift, we see that there is no overlap whatsoever between the reference data and production data. If our model never learned to predict labels of a specific class during training, it will definitely fail to do so in production!
Recall that concept drift is a change in P(Y|X) over time. It is not enough to detect changes in the distribution of true labels over time since this can remain constant while there’s been a significant change in the relation between the features and the targets. In the report, we can see the correlation between different features and the target in both dev data and production data.
While in more general data drift, there may be changes in features that are not crucial to the model’s prediction, if we detect any significant concept drift, it is very likely that our model is making bad calls.
Updating Your Model
You’ve monitored your model, you have identified significant concept drift, and when retraining your model you see significant improvements, now it’s time to deploy the updated model. This whole process is a completely normal part of the lifecycle of ML models in production, and you should consider best practices for making this process as smooth as possible.
A/B Testing for ML Models
Just as for any UX feature, we can use A/B testing for our ML models. This way, we are able to evaluate whether the newer model actually achieves better performance in scenarios where we don’t have full information. For example, a recommendation system such as the one used by Netflix knows if the user decides to watch a suggested movie, but cannot know if that user would watch some other movie that would have been suggested instead. In such a case we could run two models simultaneously and compare their success rate.
Additionally, using A/B testing can help you avoid issues when deploying a new model since it provides a more gradual transition. Thus, we can start off by directing some small percentage of the load to the new model and evaluate performance, then slowly increase this percentage for a smooth transition.
Classic UX A/B testing can be used to evaluate the performance of a new candidate ML model “in the wild” (source)
Once you’ve deployed a steady ML model, and you’ve gone through the process of retraining your model and deploying it again, it’s time to automate the process. In some scenarios, where data changes are very swift it could be worth trying a more risky online learning approach. In this approach, your model is updated whenever new examples are available.
On the left hand side we have an illustration of the more common practice of retraining on new data, on the right-hand side – an illustration of the online learning paradigm, we use the same model that is updated with time (source)
Keeping track of ML model life cycle made easy with Deepchecks system. source: Deepchecks
Monitoring ML systems is an emerging field that is not fully developed. We have seen various methods for monitoring your model and your data in production, detecting potential issues, and identifying their root cause early on. The insights you develop using these methods will help you understand whether your data pipeline is broken, whether it’s time to train a new model, or whether you may continue working on your next project without worry.