🎉 Deepchecks raised $14m!  Click here to find out more 🚀

Principles in Monitoring Your ML Systems


Machine Learning (ML) models are becoming increasingly popular in many industries, from fraud detection in the financial sector to chatbots for customer service to online advertising and so many other applications. These models operate at the center of large companies’ products, and it is critical to detect when something goes wrong as soon as possible.

This is no simple task when we consider the complexity of ML models. If we try to monitor them like any other piece of software that runs in production, we are likely to miss many delicate issues and only detect them after they have caused a tremendous loss to our company.

Keep Your Metrics Closer

Imagine you are suffering from a mild pain in your leg. It becomes extremely painful for a moment, and then it passes. You keep telling yourself it’s not so bad, and therefore never get it fixed. If it would just be a little more painful, you’d get it checked out right away and you would probably suffer much less.

A good monitoring system should make you feel that pain when you are doubtful whether it is even there. When it comes to monitoring ML models, it is more complicated to define and detect when something goes “wrong.”

Machine Learning Model Monitoring vss Classic Software Systems monitoring

  • Less noticeable. It is likely that your model will output a valid value that is incorrect. No exception will be thrown, your website’s home page won’t be down, you will just get bad results.
  • Harder to test. It does not work to run a simple test on a single sample, since outputs are non-deterministic. What we want to find out is whether our accuracy is lower than expected.
  • Development. Production discrepancy. Development data often does not simulate real-world data precisely enough. Since real-world data is not labeled it is not easy to detect errors.

How We Should Test Machine Learning Model Performance

The first thing we need to monitor is the metrics that indicate how well our model is performing. Common metrics can be accuracy, F1 score, precision, recall, and AUC. If we detect deterioration over time, or if the results are significantly different from the development set, this needs to be addressed.

Note that in order to detect the deterioration, we need to perform continuous evaluation on our model, preferably on the real-world data. In some cases, we can automatically retrieve labeled data after the fact (e.g., trying to predict the value of a stock tomorrow where we can then evaluate the performance by the following day), however, in many cases it can prove difficult to retrieve labels for the production data.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

What might cause your ML model to fail in production?

The most common issues that come up for ML models in production relate to mismatch or change in the data. We will discuss some of the main concepts you should be familiar with.

Data drift. As time progresses, the data distribution gradually drifts from where we started. This can be caused by some change in the data structure (e.g., adding a new option for gender in a form), or by some change in the real world. (e.g., inflation can affect pricing of properties, life expectancy goes up with time, user base becomes less tech-oriented.)

We can detect this by comparing the distributions of all features in the production data and comparing them with the original distributions from the dev set.

Concept drift. This is similar to data drift, but here the focus is on the change in the target which we are trying to predict. For a classification task, for example, we can think of it as though the definition of the different classes alters with time.

Development-production data discrepancy. Say we are trying to predict the price of a property given all sorts of features, including the house measurements. Now, imagine that these measurements are in different units in development data and production data. Or even worse, if the height and width are swapped in the database after the latest deployment.


Monitoring ML models in production is a very important but very tricky task. To detect issues before causing significant damage, we cannot simply treat our ML models as a black box that works. We have seen that the model we believed to be perfect may stop working the way we think it should.

If you have any ML models in production at the moment, ask yourself these questions:

  • What are the metrics that I am constantly monitoring? Are they a good indicator of success?
  • How will I know when there is some mismatch between the development data and the production data?
  • How long will it take me to find out my model is not working as expected?
Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Recent Blog Posts

Training Custom Large Language Models
Training Custom Large Language Models
How to Train Generative AI Models
How to Train Generative AI Models