Deploying ML models to production can pose a variety of new challenges that may be related to many different components of the system. This article aims to help you navigate through the world of ML in production, and provide a basic understanding of concepts that will enable you to be in control of your ML models even after they are deployed.
ML model as part of a larger picture of the system. Source: “Hidden Technical Debt in Machine Learning Systems” (Scully et al.)
1. Observability and monitoring
In order to be in control of your ML models in production, it is essential to receive live information regarding performance. This is usually done by setting up a dashboard with live information regarding model evaluation metrics, metrics such as uptime, resource utilization, latency, and notifications that pop up using customized triggers.
The idea of observability is that each part of the system can be observed in action, starting from the input data, engineered features, and finally model performance and predictions. When it comes to observability the rule is “the more the merrier”.
Observability is the key to being in control (Source)
2. Data integrity issues
Similar terms are training-serving skew and data skew.
In most ML applications, there is a long and complex process that generates and transforms the data that ends up being fed to our model. Some of these stages may not even be under our control, and thus a new deployment in another company’s website can affect the final format of the input to our model in a significant way. For example, even a minor change like a field rename or an introduction of a new value to the gender field may cause our model to perform poorly, and we may never be notified of this issue.
Data integrity issues have the potential to destroy our model, however some simple steps can be taken to eliminate such a threat.
NaN values in production data can be caused by sudden changes in data schema, or by training serving mismatch (source)
3. Model degradation and staleness
As Ecclesiastes said it: “To everything, there is a season, and a time to every purpose under the heaven, so too for your ML model”
A Machine Learning model is only as good as the data it’s trained on, and thus as the world changes and the new data shifts from what it once was, your model is likely to become stale and degrade. This process is perfectly normal and it can take different amounts of time depending on the scenario.
ML models tend to become stale over time (source)
These processes can be detected by directly measuring the degradation in performance metrics, by estimating the length of the process based on historical data, or by detecting data drift.
4. Data drift and concept drift
Data drift and concept drift are the most common causes of model degradation.
Data drift: When P(X) – the distribution of features, changes over time. This can happen either because of some shift in the data structure or because of a change in the real world. For example, following a financial crisis the average profile for a person requesting a loan might change.
Data drift can be detected in real-time, even when the real labels are not available. Thus data drift can serve as a signal for model degradation in any scenario.
Concept drift: When P(Y|X) – the distribution of correct labels given the features, changes over time. This too can be caused by a shift in the data structure or by a change in reality but affects prediction quality indefinitely. For example, the advertisement click rate for a specific product may change dramatically when competition enters the market. Similar terms are target drift and model drift.
5. Feedback loop
Ever heard that horrible noise when a mic is too close to the speaker? Apparently, that can happen to your ML model as well.
Sometimes your model’s predictions can cause its own failure (source)
Say you created a brand new model to predict stock prices based on Donald Trump’s tweets. You test your model and achieve high accuracy, and so you deploy your model and start buying and selling stocks based on the model’s prediction. It may work at first, but soon enough the model’s prediction might affect the stock market behavior and thus generate a type of data drift or concept drift. (This is also called the Paradox of Predictability)
6. Model retraining
As new labeled data becomes available, and as our production model degrades, it’s time to retrain the model. Typically the model is trained from scratch on the full dataset, however there are paradigms such as incremental learning and online learning that attempt to update an existing model by training on new examples as they become available instead of training the model from scratch.
Keeping track of recent model retrain iterations using Deepchecks system (source)
On the left – common practice of retraining from scratch, on the right – incremental learning/online learning paradigm, we use the same model, which is updated as new examples become available (source)
7. Seasonality and Data Fluctuation Patterns
Detecting concept drift or decrease in model performance is not the end of the game. We must ask ourselves what might have caused the shift, in order to understand whether our model will “go back to normal”, and whether we should create a more robust model that won’t undergo the same degradation process.
One of the most common patterns is seasonality, for example, sales increase at the end of the year and during. Tourist attractions get more customers when the weather is nice, public transportation is used at different times on weekends, etc.
An increase in sales towards the end of the calendar year is a typical example of seasonality (source)
While seasonality is something that should be accounted for in the training data (date feature will have some effect on prediction), more sudden data fluctuations cannot be predicted. For example, the Covid-19 outbreak affected many industries and caused significant concept drift for online shopping for example.
Identifying the pattern of the fluctuation will make us wiser regarding the cause of model degradation, and it will help us understand what our model is missing.
8. Batch vs. Realtime Processing
When applying an ML model to data in production there are typically two options.
Batch processing – Feeding the model batches of examples of a set size is typically more efficient since we can optimize the parallelization capabilities by selecting the best batch size. However, this is not a good option if we are supposed to process the request and make a prediction in real-time. When the options for possible inputs are limited, or when we know what to expect (e.g. recommendation for a known user), we can make predictions offline and store them, in which case batch processing can work nicely.
Realtime processing or stream processing – In a typical setting, we receive a request from a user, preprocess the request, feed it to the ML model, and then post-process and return the result in real-time. Latency can have a significant negative impact and thus you may need to compress your model and try to minimize operations per request in order to keep the latency to a minimum.
For more information, check out this post.
9. Model compression
Model compression is used to enable quicker predictions and reduce latency, and reduce the memory footprint of the model where the memory or Disk size is limited (e.g. on-device ML). This can be done in multiple ways:
Quantization – Perhaps the most straightforward way to compress an ML model – reduce the floating-point precision for each parameter. For example, we can use 16-bit floats or even 8-bit integers to represent the model weights without affecting quality drastically. (Some examples try to use single bits)
Pruning – Neural Network pruning is the process of eliminating parameters of the network (perhaps even a whole neuron, or a layer) iteratively, in order to compress the model and improve inference speed without hurting accuracy. The number of weights can be reduced by about 90% in many cases without significant change inaccuracy.
(The “Lottery ticket hypothesis” aims to show that we may be able to train a more efficient equivalent network as well.)
Knowledge distillation – using a teacher-student model, we can create an equivalent model that is simpler and smaller (the student) that learns to imitate the larger model (teacher).
Using the teacher-student model can help reduce model memory footprint and efficiency (Source)
10. A/B testing
A/B testing is a procedure for comparing two variants of a product to test response to a specific feature. In this process, we serve a certain percentage of the users with variant A while serving the rest with variant B. Thus, any statistical differences between the two groups can give an accurate representation of the effect of the tested feature.
In a similar fashion, we can use A/B testing for our ML models. This way, we are able to evaluate whether the newer model actually achieves better performance in scenarios where we don’t have full information. For example, a recommendation system such as the one used by Netflix knows if the user decides to watch a suggested movie, but cannot know if that user would watch some other movie that would have been suggested instead. In such a case we could run two models simultaneously and compare their success rate.
Additionally, using an A/B testing framework can help you avoid issues when deploying a new model since it provides a more gradual transition. Thus, we can start off by directing some small percentage of the load to the new model and evaluate performance, then slowly increase this percentage for a smooth transition.
A/B testing for comparing competing ML models (Source)
We have seen some basic concepts regarding ML systems in production, which will hopefully enable you to start navigating this sea and take control of your models before it’s too late.