ML models often become stale over time for various reasons, and thus models that might have been a good match for the task in the past may stop being relevant. Therefore, it is essential to understand when and how often one should retrain their ML model, in order to ensure optimal performance without excessive overhead. This is an extremely important concept in the world of machine learning in production.
What Causes Model Staleness?
Perhaps the primary cause for model staleness and performance degradation is concept drift (known also as model drift). One of the core assumptions in ML is that the distribution of the training data is the same as that of the test data. However, changes in the world such as shifting markets and user behavior as well as seasonal changes can greatly affect the essence of the phenomenon we are trying to model, deeming our existing ML model irrelevant. Additional common causes include changes in the format of the data such as features that become available, or features that have stopped being available.
Triggers for Retraining
Which events should trigger a retraining of your model? For one, if a significant amount of additional data becomes available, especially if the original training dataset wasn’t very large – retraining will boost your model’s performance significantly.
Another thing to notice is whether there are any schema changes in the data such as renaming of a column or introduction of a new feature. These changes can have a dramatic effect on the model’s performance. In some cases, this may be solved by a simple processing step to translate between formats, but in other cases may require a full retraining procedure.
Furthermore, if your model’s performance is degrading, it is likely time to retrain your model. This process can even be automated by using triggers that detect when performance goes under some threshold, or using anomaly detection techniques. However, it is not always easy to assess the model’s performance in real time as ground truth labels may not be available for a while, and thus you may need to estimate the correct time to retrain. This could be done using backtesting, by dividing the existing training data into “past” and “future” examples, and measuring the corresponding drift. Additionally, significant data drift is likely to reflect concept drift, and this can be measured even when labels aren’t available.
Another point to consider, is that dramatic trends and events in the actual world may indicate that your model will stop performing as expected. For example, the spread of the coronavirus pandemic hit industries around the world, affecting many ML models in production. In April of 2020, the top 10 search terms on Amazon.com included: toilet paper, face mask, and hand sanitizer – items which were significantly less popular before the pandemic.
Finally, you should ask yourself whether your model might be subject to a feedback loop, where the model’s predictions may actively cause performance degradation over time, or whether your model may be subject to adversarial attacks, where users are actively trying to generate a specific outcome from your model. If one of these issues apply in your case, your are likely to need to retrain your models more frequently.
Following is a useful chart by Henrik Skogström that summarizes the cases in which you may need to retrain your ML model:
Decision tree for whether it’s time to retrain your model (source)
How to Retrain Your Model
Retraining an ML model in the restricted sense means using the same architecture and hyperparameters, and then training the model on current available data. However, additional tweaks and changes to the model can be introduced as well. Essentially, if the reason for your training a new model is decrease in performance and model staleness, this should be framed as model retraining.
Which Data Should Be Used?
One question that comes up naturally when you retrain ML models is whether you should remove old data entries as you incorporate new ones, or simply extend the dataset to include the new samples. After all, if concept drift is caused because the old data distribution is significantly different from the current data, perhaps we should get rid of the old data! There is no one size fits all answer for this issue, and you may need to experiment on historical data with backtesting, to determine the extent of drift and find out which option is the best match for your setting.
Another subject worth addressing is proper methodologies for risk-free deployment. Simply disabling the old model and replacing it with the new one may pose significant risks in case the new model does not run properly, and in that intermediate time, there is no operating model available. Thus, there are some practices that you should be aware of that minimize this risk. Shadow deployment is such a practice, which involves deploying the new model to production, while it runs silently in parallel to the old model. This way it can be verified that the model is performing as expected before taking any risks. Another practice is canary deployment, which is similar to the idea of A/B testing. The new model receives a specific portion of the incoming requests, and thus a comparative analysis can be drawn between the two models. This enables a smooth transition when deploying a new model.
In some scenarios a continual learning/online learning framework is a better fit than using static models, and thus there is no need for the concept of machine learning retraining. According to this paradigm, ML models are inherently objects that change over time, and as new examples become available, the model trains on them in production. Due to the risks of training your model in production, this is not widely used in the industry, but is studied as more of a theoretical framework.
Retraining machine learning models is becoming a standard practice, and there is no reason it should be a complex operation with the available tools for automating the process such as Deepchecks. Always remember – keep your models fresh and your data clean!