Deploying machine learning models to production is a complex task that requires various expertise, and there are many things that may go wrong in the process of integrating an ML model that was developed in an experimental environment, in a real-world environment.
In order to perform this process correctly, there are many things to consider such as the required resources, system architecture, ability to detect issues and monitor performance, and reproduce predictions for debugging. In this post we focus on some practices and concepts that will help you navigate this process.
This process is generally similar to any other software system, however there are some considerations that are more typical to ML systems.
Batch vs. Stream Processing
Is your model required to generate predictions in real-time (stream), or can it operate offline (batch)? Generally computations of ML models can be done efficiently when operating on batches due to the parallelization potential. For systems that are in a real-time setting you will probably want to run the model on multiple endpoints and use a load balancing mechanism to handle incoming requests.
For ML models, utilizing GPU capabilities helps reduce runtime and latency significantly. However, using the most up-to-date hardware can be quite costly. Thus understanding the requirements of your system can help you achieve the right balance between good performance and minimal resource cost.
Compressing your model before deployment can save a significant amount of resources. This is especially relevant for on-device ML models. Some common practices include model quantization, finding a lottery ticket for a NN, and knowledge distillation.
Access and Observability
In order to ensure high performance of ML models in production, deployment should enable easy access to evaluation metrics, reproducibility of predictions and easy debugging.
In order to enable in-depth understanding of your model’s performance in production you will need comprehensive logging. You may want to log a high percentage of requests and predictions, telemetries such as resource utilization, number of requests and more. A verbose log will enable you to locate anomalies and detect the causes for potential issues.
In order to fully understand your model’s performance, it is highly important that you be able to reproduce predictions that are made in production. Regulations such as the proposed EU regulations will require this, in order for data science teams to be able to account for potential mistakes made by ML systems and potentially correct them.
To facilitate this requirement you will need proper version management of your model and datasets (check out DVC and MLFlow) in addition to surrounding code in order to enable easy debugging and reproducing of a given prediction.
After deploying your model to production, there are many risks and reasons your model may not perform as expected. Proper monitoring of your machine learning model performance together with the input data and its distribution over time is essential in order to detect potential issues early on, and notify you when there are issues with the data pipeline or if your model may need retraining.
Automating the Process
Only 14% of companies with ML systems can deploy a model in under a week. (source)
In Algorithmia’s report titled “State of Enterprise ML”, the authors show that the deployment process for most companies takes an extremely long time. Ideally we would like to enable data scientists to focus their energy on research of new problems and creation of new ML models that provide value to the company. However, all too often data scientists end up spending much of their time on the deployment process of models which already “work”. Automating the process and using third party tools can be helpful in reducing this “hidden technical debt” significantly. Following are some pointers on areas that can be improved significantly with automation.
Model retraining – Many models become stale over time due to data drift and concept drift. Thus, when new data becomes available you will want to automatically retrain the same model and deploy the model automatically.
Automatic testing – basic testing before deployment of any model should be done automatically. Testing machine learning models is not so straightforward, but with a little effort it is possible to implement comprehensive tests that ensure your model is performing as expected.
Deployment strategy – How do we seamlessly deploy a new model without interrupting the operation of the current model in production? Strategies from traditional software deployment can be applied such as canary deployment where at first only a small percentage of traffic is directed to the new model to ensure validity. Shadow deployment is even less risky, enabling the production data to flow to the new model while the model’s predictions are not used in this stage, while using the existing production model without interruption. Finding the correct process and automating it can save a significant amount of time and protect you from potential catastrophes in failed deployments.
To sum it up, there are many things to consider when it comes to the machine learning model deployment process. The different aspects relate to expertise in different fields such as data science, DevOps and engineering. Thus a proper deployment process requires members of the team to work together. Furthermore, automation and use of third party solutions can assist with making this process smooth and enabling data scientists to focus more of their time on research, which is their stronger side.