🎉 Deepchecks raised $14m!  Click here to find out more đźš€

ML Model Monitoring: Best Practices for Performance and Cost Optimization

Introduction

As companies deploy more and more Machine Learning (ML) models to production, stakeholders are beginning to understand the importance of keeping track and monitoring these models to ensure they are performing as expected. While directing resources toward proper deployment and monitoring of Machine Learning models in production may be costly, having a dysfunctional model that is undetected or having a bug that takes a long time to locate and fix is a much greater cost for your company. In this post, we will discuss some of the best practices for monitoring ML models in production that can save you plenty of headaches and spare important resources for further down the road.

Monitor Your Data

An ML model’s performance is intertwined with the quality of the data it is fed. If the data distribution remains similar to the training data, your model will likely continue to perform as expected. If there is significant data drift then your model will probably not be equipped for the task. By monitoring the input data at its different stages and detecting data drift or data integrity issues, you’ll be able to prevent catastrophes early on.

Fine-grained Monitoring

When it comes to ML model performance monitoring, “he more the merrier. Only monitoring the model’s overall accuracy over time makes it more likely to detect potential issues later on. What’s more, you wouldn’t know why your model failed. To gain meaningful insights, we suggest monitoring the performance of important data slices (e.g., VIP users, sensitive attributes, etc), and per-class performance as a start.

Verbose Logging

By logging your predictions and the model’s actions, you’ll be able to trace individual predictions and analyze the causes for your model’s errors. Displaying warning and error messages in a way that’s accessible and visible is also a must. All too often, models then fail silently when the data structure changes abruptly.

Continuous Evaluation

It is common for ground truth labels to not be available for a long time after the predictions are made. For example, when a bank must assess whether an individual is likely to pay off their mortgage, it may take 20 years for that label to become available. How, then, do we monitor the model’s success?

  • When new labels become available, use them to test your model’s performance. If there is significant degradation, it may be time to retrain your model.
  • Compare your model’s predictions with a baseline model that uses simple reliable logic. If your model drifts far away from what you believe is a fairly good baseline, there might be a problem.
  • Use human evaluators with domain knowledge to assess the model’s predictions. Would they generate different predictions? If so, try to understand what caused your model to produce an “error.” Using libraries like SHAP or ELI5 can be helpful.

Collaboration Across Disciplines

Monitoring ML models requires collaboration across disciplines

Monitoring ML models requires collaboration across disciplines (source)

Proper monitoring of ML models requires expertise and fluency both in data science and DevOps. This process should be done as a team effort, with all parties developing a common language. In some scenarios, the role of the ML engineer or designated MLOps team function as a bridge between the disciplines.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Can You Do Better?

As a preventative measure, we recommend training new models as soon as new data becomes available. You will be able to identify an opportunity to improve your model dramatically even before a degradation trigger is activated. This is especially relevant for models which have been trained on small datasets or in cases where there is reason to suspect that there is significant data drift or concept drift.

Use a Unified Solution

It is common for data science teams to try to “reinvent the wheel” when it comes to monitoring. A single company may have multiple scripts that are in charge of monitoring different ML models. This bad practice is quite costly and  highly inefficient for the long term. Using a single solution, whether developed in-house or by a third party service, will help you reduce that hidden technical debt.

Monitoring Your KPIs

When we talk about cost optimization, it is important to be able to quantify how much value is added by an additional accuracy point. It is important to monitor metrics that provide insights to the stakeholders from the business perspective. An informed decision can be made whether it is worth putting more effort into improving existing models or moving on to other problems.

Integration with Development Environment

Proper integration of the monitoring system with a development environment can cut down debugging time significantly. Imagine detecting a strange prediction in production, and then having access to a debugging window with the same input after processing and the loaded model at a single click. The ease of this process is what ensures that insights from the monitoring stage are used to develop and improve the next model.

Proper integration will help you incorporate insights from the monitoring phase into the next model

Proper integration will help you incorporate insights from the monitoring phase into the next model (source)

Resource Optimization

Last but not least, resource optimization. Understanding your model’s required resources to perform seamlessly can help reduce significant costs. Some specifics to note in regards to ML models are GPU utilization and prediction time (that affects latency). Monitoring these metrics along with traditional traffic analytics helps you optimize the resources used for your model in production.

Conclusion

We at Deepchecks firmly believe that having a strong framework for deploying and monitoring ML models enables you to get the most bang for your buck and devote more resources to researching new problems rather than debugging old ones. Feel free to reach out to us with questions or comments!

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Recent Blog Posts

Training Custom Large Language Models
Training Custom Large Language Models
How to Train Generative AI Models
How to Train Generative AI Models