The deployment of machine learning models at scale is one of the most difficult challenge that exists in any organization looking into monetizing the benefits of machine learning. The models have nowadays moved from the lab environment (where accuracy was the main consideration and mostly the only consideration) to the production environment where we have near real-time or real-time implementations of deployment of machine learning models. There are three basic implementations of deploying machine learning models:
These are mostly REST APIs deployed with taking in a post request from the client-side and the server then gives an input received over a post request and responds back with the response of the machine learning model.
For example google’s cloud vision APIs are a hosted service where we can send in a post request with an image and get back the results from prediction as the response for the same as and when we need it. In nthe example below we are able to see the response of google vision API to a demo image.
Mostly done when we don’t know what frequency we are going to have for the incoming data and the results aren’t immediately required, this implementation is preferred where we aggregate the incoming infrequent data and process these in tranches, where we can make use of borrowed and temporary hosting for the basic machine learning model infrastructure.
For example batch processing of large stash of documents using apache beam.
Edge deployment is referred to a setup whereby instead of passing input data to a back end the model prediction is computed on edge devices.
It’s a preferred method since it:
- Helps improve latency, since the data is processes in house
- Reduces cost.
- Adds to the security of the data by processing sensitive data at the edge. (for example processing PIIs on edge, reduces the attack surface and therefore the risk of critical data leaks)
The deployment is usually directly done at the source of the incoming data, for example within a mobile device for facial recognition or a raspberry pie for micro-controllers, where the data is not being transferred to any backend, instead the entire processing of the same is done on the point of origination itself. The processing of critical data locally adds to the security aspect by improving the
The Deployment Checklist
We are going to create a simple checklist to ensure that we are formalizing the procedure for machine-learning deployments. We will take the process forward from the data science model development to the data science model deployment. We will assume that there is a working version of the model that is giving us satisfactory accuracy on the designated metric for measuring the performance of the model.
1. Determine appropriate deployment structure and pipeline
The determination of the structure of the deployment pipeline into one of the above categories namely:
- On-demand deployment
- Edge deployment
- Batch processing
- Hybrid deployment
This is primarily determined based on the frequency of the incoming dataset to create a new structure and the “urgency” of the expected results.
2. Optimise pre-processing pipeline
Depending on the choice of the type of deployment we are required to also create an appropriate ingestion pipeline that is capable of maintaining optimized flow of data through it. We not only expect the data pipeline to be up to the mark and optimized, but the same considerations also should be there when we will be looking at the consumption of the results flowing out of our model.
These post-processing pipelines are often the most ignored aspect of a machine learning implementation. We may have issues like limits for database ingestion rate for batch processing or a really high number of parallel requests that are overwhelming for the server to handle.
3. Clean the code for model deployment
No matter how reliable and scalable any deployment code is there needs to be an appropriate exception handling that needs to be set up. A deployable machine learning code needs to have the following considerations:
- Robust exception handling
- Exception logging, a generic abstract method example to use
- Prediction results logging to capture Production drift
- Optimize resource utilization
- Restricting response timings and terminating requests
We can never ensure what kind of requests are going to be coming in from the users. We don’t want the expensive hardware to be sitting idle only because a function in the entire code has errored out.
Restricting a response time (in line with the maximum time we expect the model to take for the extreme entries in the inference data distribution) helps a great deal in ensuring a smooth functioning E2E pipeline.
4. Containerized your code
This is perhaps the most critical piece that there is to ensure that we have a deployable code that scales. A stateless application is always the preferred course to deploy machine learning models, containerizing the same ensures that we never run into the dependencies issues, if and when an opensource module is discontinued or is upgraded to an unsupported version of itself.
Containers help us by:
- Deployiung our solution at scale quickly
- Quickly adapt solution to different cloud and local environments
The figure below shows one such solution architecture deployed on AWS with containerized inference for machine learning model.
The most frequently used and easily accessible way to containerize is to use docker. It’s easy to use and the long-running stint within the community ensures that we find solutions to any issues that we encounter during the building or a runtime process. The isolation for containerization also ensures that we are able to ensure higher security for our application.
5. Create versions
The model, as well as the surrounding code in the container, must ensure that we have standardized versioning of the same to keep track of not only the progress but also to keep track of any breaking changes in the subsequent upgrades that we are going to do for the same.
Some of the popular versioning tools are
- MLMD (Machine learning metadata)
6. Tracking mechanism
Ensuring that our models are relevant and consistent as they are deployed in a production environment a consistent monitoring mechanism is important, a simple implementation of a cron perhaps just to run the accuracy determinations for the deployed model by keeping out a fraction of the incoming dataset and testing/benchmarking the accuracy of the same.
It’s also potent to route the requests to a cache for some time and keep track of
- Response time
- Inference data distribution
- Result consistency (accross versions)
- Request size
- Concept Drift
- Automated Traning triggers
- Feature importance shift
To ensure that we are developing, testing machine learning models, and deploying machine learning in production on the fly as well to ensure that there is required consistency in the behavior of our deployment.
The production deployment of the model is the ultimate test of any data science model deployment and should be done ensuring that we keep in mind the user experience, like any other application we create. We should therefore strive to ensure that our deployment has low latency, consistency in response to the same data, exception handling, drift monitoring, automated upgrades (retraining triggers). The image below demonstrates how auto-mated triggers for retraining models helps us keeping the quality of the models consistent overtime.
These considerations and our checklist when properly followed will help in determining that we are able to ensure a successful deployment that is capable of delivering customer delight to the maximum possible extent. In the words of Godfrey Reggio “It’s not that we use technology, we live technology”.