Do you have a Machine Learning model in production and want to know if it is still generating good results? Do you wonder what areas you should cover in your Machine Learning monitoring framework? Are you looking for an ML model monitoring checklist that you can rely on? This article lists the critical areas you need to monitor in your Machine Learning workflow, so you do not overlook areas that can break your model.
Read on to learn why and what you should monitor!
Why Should You Monitor Your Machine Learning Workflow?
It is hard to make the results of your Machine Learning model reproducible when the external world, the data you receive about it, and the predictability of your target variables may be in constant flux. A small adjustment at a single point can lead to a huge change in the output because of your model’s interconnectedness.
Change might not even be an issue, but the uncertainty the possibility of change produces. The lack of trust in your model’s results prevents you and your stakeholders from making decisions and building products. To have confidence in your model, you need to check whether it is still on track frequently.
To decrease uncertainty, you need to monitor your Machine Learning model.
Monitoring Machine Learning models in production is not easy as you need to check on multiple elements within your Machine Learning workflow.
We created this AI model monitoring checklist to verify whether you cover all the necessary areas that can go wrong. Using it will make you more confident in your monitoring efforts and your model.
The Machine Learning Model Monitoring Checklist
We made this list as comprehensive as we could. However, your project may require further issues to consider depending on your problem and the development of new technologies and methods. Be sure to check our blog for additional points!
We organized Machine Learning monitoring areas into the following categories:
- Data Quality and Integrity,
- Model Validation,
- Model Retraining & Model Replacement,
- Model Serving,
- Service Operational Health.
These categories help you to think through monitoring in the context of the whole Machine Learning workflow.
We wrote this checklist specifically about monitoring and not about deployment. Many issues of putting Machine Learning models into production overlap with issues in monitoring, but they both have specific areas you have to address separately.
This list extends your DevOps monitoring checklist with considerations about Machine Learning models. If your MLOps, Data Engineering, DataOps are separate processes, you may need to consolidate them.
Data Quality & Integrity
Your Machine Learning pipeline relies on the data you ingest into it. To make it successful, you need to maintain the quality and integrity of your data.
You maintain data quality by cleaning the data from common issues and conducting data preparation for Machine Learning. By maintaining data integrity, you ensure that the data is recorded and retrieved consistently and as intended at different points of the Machine Learning workflow.
You monitor for quality and integrity by introducing checks at critical points of the Machine Learning pipeline. Here is a list of phases between which you should monitor the state of your data:
- Data collection,
- Data preparation,
- Feature engineering,
- Model validation and testing,
There are many possible sources of bad data. Here are the most common examples:
- Data transformations within your workflow;
- New data from your existing source;
- Upstream changes in your current data sources (e.g., in your database or data engineering pipeline).
Data can be ‘bad’ in multiple ways, depending on your use case and demands. Probably you already cover most data-related issues during your data cleaning phase. You definitely should check for the following problems:
- Data loss,
- Missing values,
Ask these questions when you are thinking about monitoring data quality and integrity:
- Checkpoints: What are the main phases of your data transformation workflow, and do you have checks implemented between them?
- Sources: What are possible data corruption sources, and how do you recognize their occurrence?
- Types: What are the possible data corruption types, and how do you recognize them?
Model validation is a core element in Machine Learning monitoring. Its core purpose is to verify whether your models perform well and do what you intend them to do.
Validating models introduce an additional complexity compared to code tests and even o data validation because of how Machine Learning training works and the models’ indeterministic output.
The primary area of model validation is model performance monitoring, that is, checking whether your models’ outputs align with your business goals. Model validation subcategories cover further issues like data and concept drift, segment-level performance, and bias.
Machine Learning Model Performance Monitoring
Your primary model performance monitoring aim is to see whether your model continues to produce good results based on your chosen Machine Learning model monitoring metrics. You monitor this by looking at changes in the metric’s value over time and setting alerts whenever the change passes a particular threshold. However, there are additional aspects you need to consider besides this direct performance stability check.
Here is a list of questions to ask when planning your Machine Learning model performance monitoring framework:
- Suitability: Are you using the right metric for the specific modeling problem and business case?
- Coverage: Do you have meaningful access to true values for all your target labels? If you do not, can you use leading indicators?
- Consistency: Is your model’s performance consistent across its different uses (e.g., experimentation vs. production, validation vs. testing, between retraining sessions)?
- Adversity: Are there adversarial mechanisms in play you need to monitor and protect against?
- Acceptability: What model performance do you consider ‘good’, and what degree and type of deviation can you accept?
- Value: What is the cost of maintaining/improving model performance? How much value does it generate? Is it worth improving it?
Data Drift and Concept Drift
In model validation, we can distinguish between the following two types of model drift:
- Data drift: The incoming data’s statistical attributes change over time.
- Concept drift: The predictive relationship changes between your input features and the target label.
Either case can severely degrade your model’s performance, so by catching them, you can prevent performance loss.
Identifying drift is not easy. It requires deep knowledge of your problem, the models you use, and your modeling workflow. For example, a shifting mean can be an issue you need to address or the sign of a trend or seasonality your time series model already expects.
When you think about monitoring model drift, you may ask the following questions:
- What are the expected statistical attributes of your features, and of what changes should you be aware? How do you identify them?
- How do you monitor the relationship between your input features and the target labels? What is an acceptable change?
Segments and Bias
It is not always useful to measure performance on an aggregate level, on your whole dataset, and for all your labels. Often, it is more meaningful to focus on specific segments within your dataset (e.g., returning customers, specific geographies, etc.). To avoid bias and maintain fairness, you also need to identify demographics whose characteristics your model may misinterpret.
For these reasons, you may want to introduce segment-level performance checks and measures to prevent bias in your model. Think through these questions:
- What segments are critical for your business, and how should you monitor your model’s performance on them?
- Are there any demographics your model might be biased about? How can you recognize and prevent bias in time?
Model Retraining & Model Replacement
Deploying a Machine Learning model into production does not mean the end of its lifecycle.
As your model produces predictions on new data, you can feed these results back into your model and retrain it to maintain or increase its performance.
Another use of this mechanism is to run contesting models parallel with the one in production, feed them with the same input data and see whether they produce better results.
If one of the contesting models starts to overperform the original, you can move it into production using A/B testing or shadow deployment.
To review these cases, ask these questions:
- Can you use new labels to retrain your model in a meaningful timeframe?
- Can you use the production models for model evaluation, or are they too expensive to rerun for every tiny bit? Can you use a proxy model instead or verify only on a subset of the data or the pipeline?
- Is there a danger that automatically feeding back data into your models will generate worse performance or lead to negative feedback loops?
Because of the Continuous Training mechanism, your serving model in production may run parallel with the model in retraining. When you monitor your model’s health, you need to be aware of its serving-specific issues.
Questions to ask about model serving:
- What is the expected traffic the model requires to make predictions? How do you measure this?
- What is the accepted latency of your models to produce predictions? What are common sources of increased latency, and how can you monitor them?
- When will your training and serving models deviate from each other? How can you test their consistency?
Service Operational Health
Machine Learning models require operational health monitoring just as regular software projects with some special twists.
You can ask the following questions to check for Machine Learning-specific service health issues:
- What are the expected and acceptable values for your Machine Learning service health indicators (e.g., API endpoints latency, memory and CPU usage of retraining and predictions, disk utilization, costs)
- Do you have the staff on call with the right skill set to address emergency events?
Verify Your Machine Learning Monitoring Framework!
The motivation for Machine Learning pipeline monitoring is to gather information about how the ML pipeline behaves in real-time. With this capability, teams can monitor the infrastructure, validate the data, and monitor the training process and value of the model in the real world. Model monitoring is made easy by tools that enable practitioners to get valuable feedback about the general health of their models. This helps organizations measure specific metrics and improve (or not improve) the model in production via the feedback loops a monitoring tool provides.
A tool is only as valuable as what the team would like to achieve with it and so these monitoring tools are used as part of a model monitoring framework that is expected to track and report different metrics valuable to a business. Meaning that every team needs to create a framework, from metrics used for alerts, and actions, down to troubleshooting. This promotes effective maintenance of your model in production.
Monitoring ideally happens on two levels:
- Functional monitoring level
- Operational monitoring level
On a functional level, you can choose stability or performance metrics as part of your framework. For example, stability metrics might include; Population Stability Index (PSI) which can be used as a technique in monitoring data drift and model performance. Performance metrics can vary from accuracy, AUC-ROC, R-Square or other well-known metrics used to detect concept drift. Remember that the use case can also influence the choice of metrics.
Image credit: Pronojit Saha
At the operational level, operations metrics can be CPU/GPU utilization, API calls, system uptime, etc., depending on the value of this measurement to the overall goal of the organization.
Verify your monitoring framework by evaluating the value both the tool, metrics, and actions by your team bring to the businesses, functionally and operation-wise.
This article has reviewed the most common areas you should monitor in your Machine Learning workflow. This list is not exhaustive as there can be further issues depending on your project and situation. You may need to monitor security threats, privacy issues, or domain-specific ways to collect, handle, and interpret data. Go through this checklist and spend time thinking about how you will implement it in your workflow.
If you want to be sure that you cover all possible edge cases and stress situations, use monitoring services like Deepchecks. With our model validation solution, you do not have to constantly keep in mind all the whats and hows of monitoring as you get them built-in, tried out, and updated with state-of-the-art Machine Learning research.
Are you intrigued? Get started.