As companies deploy more and more ML models to production, stakeholders are beginning to understand the importance of keeping track and monitoring these models to ensure that they are performing as expected. While directing resources toward proper deployment and monitoring of machine learning models in production may be costly, having a dysfunctional model that is undetected, or a bug that takes a long time to locate and fix can have a much greater cost for your company. In this post we will discuss some best practices for monitoring ML models in production, which may save you a great amount of headaches and spare important resources further down the road.
Monitor Your Data
An ML model’s performance is intertwined with the quality of the data it’s fed. If the data distribution remains similar to the training data, your model will likely continue to perform as expected. On the other hand, if there is significant data drift your model will probably not be equipped for the task. By monitoring the input data at it’s different stages, and detecting data drift or data integrity issues, you’ll be able to prevent potential catastrophes early on.
Fine Grained Monitoring
When it comes to machine learning model performance monitoring, the rule is “the more the merrier”. If, for example, you only monitor the model’s overall accuracy over time you are likely to detect potential issues later on, and furthermore, you will not have a clue as to the reason for your model’s failure. In order to gain more meaningful insights, we suggest monitoring performance of important data slices (e.g. VIP users, sensitive attributes, etc), as well as per-class performance as a start.
Log your predictions and the model’s actions. This way, you’ll be able to trace individual predictions and analyze the causes for your model’s errors. Additionally, displaying warning and error messages in a way that’s accessible and visible is a must. All too often, models are deployed and then fail silently when the data structure changes abruptly.
It is quite common that ground truth labels don’t become available for a long time after the predictions are made. For example, when a bank must assess whether an individual is likely to pay off their mortgage, it may take 20 years for the label to become available. How then do we monitor the model’s success?
- When new labels become available, use them to test your model’s performance, if there is significant degradation, it may be time to retrain your model
- Compare your model’s predictions with a baseline model that uses simple reliable logic. If your model drifts far away from what you believe to be a fairly good baseline, there might be a problem.
- Use human evaluators with domain knowledge to assess the model’s predictions. Would they generate different predictions? If so, try to understand what caused your model to produce an “error”. Using libraries like SHAP or ELI5 can be helpful here.
Collaboration Across Disciplines
Monitoring ML models requires collaboration across disciplines (source)
Proper monitoring of ML models requires expertise and fluency both in the data science areas and DevOps. This process should be done as a team effort, with all parties developing a common language. In some scenarios the role of the ML engineer or designated MLOps team can function as a bridge between the disciplines.
Can You Do Better?
As a preventative measure, we recommend training new models automatically as new data becomes available. This way, you will be able to identify an opportunity to improve your model dramatically even before a degradation trigger is activated. This is especially relevant for models which have been trained on small datasets or in cases where there is reason to suspect that there is significant data drift or concept drift.
Use A Unified Solution
It is not uncommon for data science teams to try to reinvent the wheel when it comes to monitoring. Thus a single company may have multiple scripts that are in charge of monitoring different ML models. This bad practice can be quite costly in the long run, and it’s highly inefficient. Using a single solution, whether developed in-house or a third party service, will help you reduce the hidden technical debt in the long run.
Monitoring Your KPIs
When we talk about cost optimization, it is important to be able to quantify how much value is added by an additional accuracy point. Thus, it is important to monitor metrics that provide insights to the stakeholders from the business side of things. This way, an informed decision can be made regarding whether it is worth putting more effort into improving existing models or moving on to other problems.
Integration With Development Environment
Proper integration of the monitoring system with a development environment can cut down debugging time significantly. Imagine detecting a strange prediction in production, and then having access to a debugging window with the same input after processing and the loaded model at a single click. The ease of this process is what ensures that insights from the monitoring stage are then used to develop and improve the next model.
Proper integration will help you incorporate insights from the monitoring phase into the next model (source)
Last but not least, understanding the required resources for your model to perform seamlessly can help you reduce significant costs. Some specifics to notice regarding ML models are GPU utilization and prediction time which affects latency. Monitoring these metrics along with traditional traffic analytics can help you optimize the resources used for your model in production.
To sum it up, we tried to share some of the insights we developed at Deepchecks regarding best practices for machine learning model monitoring. We firmly believe that having a strong framework for deploying and monitoring machine learning models will enable you to get the most bang for your buck and devote more resources to researching new problems rather than debugging old ones. Feel free to reach out to us with questions or comments!