Machine learning models are becoming increasingly popular in many industries. Starting from fraud detection in the financial sector to chatbots for customer service to online advertising and so many other applications. These models operate at the center of large companies’ products, and it can be critical to detect when something goes wrong as soon as possible.
However, this is no simple task as ML models are complex objects. If we were to try to monitor them as any other piece of software that runs in production we will most probably miss many delicate issues and only detect them after they have caused a tremendous loss to our company.
Keep your metrics closer
Imagine you are suffering from some mild pain in your leg. It becomes extremely painful for a moment, and then it passes. You keep telling yourself it’s not so bad, and therefore it never gets fixed. If it would just be a little more painful, you’d get it checked out right away and you would probably end up suffering much less.
A good monitoring system should make you feel the pain when you were doubtful whether it was even there. And when it comes to monitoring Machine Learning models, it is all the more complicated to define and detect when something goes “wrong”.
Machine learning model monitoring vs Classic Software Systems monitoring
Here are some of the major differences between ML monitoring and classic software monitoring:
- Less noticeable – It is more likely that your model will output a valid value that is incorrect. No exception will be thrown, your website’s home page won’t be down, you will just get bad results.
- Harder to test – It does not work to run a simple test on a single sample, since outputs are non-deterministic. What we want to find out is whether our accuracy is lower than expected.
- Development-production discrepancy – Development data often does not simulate real-world data precisely enough. Since real-world data is not labeled it is not easy to detect errors.
How should we test machine learning model performance
The first thing we need to monitor is the metrics that indicate how well our model is performing. Common metrics can be accuracy, F1 score, precision, recall, AUC, etc. If we detect deterioration over time, or if the results are significantly different from the development set, this needs to be addressed.
Note that in order to detect the deterioration, we need to perform continuous evaluation of our model, preferably on the real world data. In some cases we can automatically retrieve labeled data after the fact (for example if we try to predict the value of a stock tomorrow, we can then evaluate our performance in the following day), however, in many cases it can prove difficult to retrieve labels for the production data.
What might cause your ML model to fail in production?
The most common issues that come up for ML models in production relate to some mismatch or change in the data. We will discuss some of the main concepts you should be familiar with.
Data drift – As time progresses the data distribution gradually drifts from where we started. This can be caused by some change in the data structure (e.g. adding a new option for gender in a form), or by some change in the real world. (e.g. inflation can affect pricing of properties, life expectancy goes up with time, user base becomes less tech-oriented, etc.)
We can detect this by comparing the distributions of all features in the production data and comparing them with the original distributions from the dev set.
Concept drift – This is a similar idea to data drift, however here the focus is on the change in the target which we are trying to predict. For a classification task, for example, we can think of it as though the definition of the different classes alters with time.
Development-Production data discrepancy – Say we are trying to predict the price of a property given all sorts of features, including the house measurements. Now imagine that these measurements are in different units in development data and production data. Or even worse, what if the height and width are swapped in the database after the latest deployment.
As we have seen, monitoring ML models in production is a very important task that can be quite tricky. In order to be able to detect issues before causing significant damage, we cannot simply treat our ML models as a black box that works. We have seen that the model we believed to be perfect may stop working the way we think it should.
If you have any ML models in production at the moment ask yourself the following questions:
- What are the metrics that I am constantly monitoring? Are they a good indicator of success?
- How will I know when there is some mismatch between the development data and the production data?
- How long will it take me to find out my model is not working as expected?