Introduction
Machine Learning models have become part of almost all major business processes, so much so that monitoring ML models has become a critical component for all major business organizations. Without active monitoring, a Machine Learning deployment model becomes redundant and the critical components it supports can potentially have catastrophic consequences for the overall project. Here, we will discuss generic considerations when shortlisting from the wide range of available tools, and how to make the best decisions for our project using a wide variety of open source and paid tools for model monitoring.
A proper Machine Learning product development lifecycle will almost always have an ML monitoring platform. An example of which is depicted in the image below
Why monitor a Machine Learning Model?
It is a basic expectation from any software component to give out consistent, or at least predictable, results. It is difficult to have the same expectations with Machine Learning since the reproducibility of the model results is subject to factors that are not always in the control of a Machine Learning practitioner. We can have significant deviation (drift) in the distribution of the incoming features over time, which may render the current ML model redundant. Small adjustments at constant intervals usually positively influence the results of a model. Since the training process for a Machine Learning model can be an expensive operation (we may need humans in the loop to annotate the datasets along with the infra to train large models), we need an ML model monitoring framework in order to perfectly time these interventions. A simple ML model monitoring metric for such interventions can be F1-Score. The illustration below outlines a model whose cycle has regular interventions to ensure that the performance always stays above a certain threshold.
Specific Components to Consider When Shortlisting
Now that we have established why we need ML model performance monitoring, another important consideration for selecting the right monitoring tool is listing out the components that we need to monitor as a “health check” for our inference pipeline. The following are some of the production challenges we need to look out for:
– Explainability for Machine Learning Models
In some implementations, it might be critical to not only get inference results, but to also explain a model’s prediction to ensure that the prediction motivation is in line with the business objective. It is achieved by allowing the relevant stakeholders to assess and validate the same through an appropriate ML monitoring platform.
– Model Readiness/Relevance for Deployment
More often than not, we will be making changes to the model version deployed to either improve on the performance metric of the model or to reduce the infra cost by making the associated components more efficient. Either way, it will be impossible to determine if such changes are making a positive impact on the overall objective if we are not monitoring the results. The chart below shows how the Machine Learning models make the transition from development to production environment by making use of the monitoring tool in tweaking the results to achieve an optimum stat:
– Data Consistency Across Releases
We need to ensure that across different releases where we might have done multiple changes to the pre-processing pipelines, we have consistency across the training and inference data processing pipelines. Absence of such checks might cause issues with the predictions of the downstream models without any notable difference in the data distribution for the incoming features.
– Monitoring Perturbations in Class Prediction
Machine Learning models are susceptible to malicious manipulation as well as any other software (e.g., introduction of a small amount of noise in a computer vision input can skew the results for predictions). Such vulnerabilities can have potentially catastrophic implications for Machine Learning models in production. We can keep track of the predicted classes for the incoming requests in order to keep a check on whether or not the distribution of the predicted classes for specific users or regions are consistent and are not being skewed heavily in favor of a specific class.
Available Tools for Model Monitoring
A number of tools are available for model monitoring, here is a list of some of the available options along with their salient features.
Deepchecks allows for multiple metrics monitoring:
- Distribution Checks
- Performance Checks
- Model Explainability Checks
- Data Integrity Checks
Neptune
A metadata store for built-in features to support the MLOps for research as well as production cycles for a significantly high number of experiments. Some of the significant features are:
- Hardware metric display
- Addition of custom metrics
- Ability to create custom dashboards
- A metric comparison across deployments
Mona
It is contextual monitoring and AI analysis system capable of monitoring a wide range of implementations with its configurations fine-tuned for domain-specific implementations such as:
- Machine learning model monitoring
- NLU/NLP including chatbots
- Intelligent automation
- Computer vision
- Speech/ audio.
Grafana + Prometheus
They are a generic tool that is not specifically built for Machine Learning tools, but their customizability makes them formidable in monitoring hardware performance and throughput.
- Alerts for specific monitoring metrics.
- Automation using Grafana scripts.
- Annotation for collaboration across different teams.
- Dashboard templates to ensure consistency and propagate interpretability.
How to Choose for which Tool
In our discussion above, we have made observations as to what are the most critical considerations to keep in mind when deciding on the monitoring system for our Machine Learning pipelines. We also covered some of the popular Machine Learning tools currently available in the community for various implementations. The specific choice for a tool will heavily depend on one of the following factors:
– Area of Implementation
Working with computer vision is different from working with NLP or NLU-based tasks. We will be interested in monitoring different components with brief overlaps when working with different domains within Machine Learning. Availability of different sets of metrics and ease of adding custom metrics to a tool is one of the most important factors in this decision.
– Tool License Consideration
In light of our discussion above, we can see that there are both open-sources and paid tools available in the market. The type of license depends on the tool, making it an important consideration for choosing a monitoring tool.
– Choice of Monitoring Metrics/Components
Availability of different metrics like F1-Score or Drift is also among the most important considerations deciding on which monitoring tool to use, since it makes the process of integration easier and seamless.
– Frequencies of Updates
The frequency of updates is important in a Machine Learning project where the throughput is high and we are expecting an alert in near-real time if something goes wrong. A good example of this can be a tool being used for monitoring a Machine Learning model in critical care rooms in a hospital.
– Ability to Create Alerts
Often an overlooked aspect of any monitoring tool, this helps create alerts on some specific predefined rules.
Conclusion
Choosing a monitoring tool might be specific to the implementation task at hand, but we have to deliberate on the generic considerations to prepare a checklist of must-haves when shortlisting a tool. We can then look at the niche abilities needed for a successful monitoring tool in our Machine Learning project. There are some great open-source implementations available that allow for a lot of easy customization and give pre-built solutions that, coming together, make for a great monitoring dashboards.