Top Considerations for Deploying Machine Learning Models


The deployment of Machine Learning models at scale is one of the most difficult challenges in any organization looking into monetizing the benefits of Machine Learning. The models nowadays have moved from the lab environment (where accuracy was the main consideration and mostly the only consideration) to the production environment (where we have near real-time or real-time implementations of deployment of Machine Learning models). There are three basic implementations of deploying Machine Learning models:

On-demand Deployment

These are mostly REST APIs deployed with a post request from the client-side, the server then gives an input received over a post request, and responds back with the Machine Learning model’s response.

For example, Google’s Cloud Vision APIs are a hosted service where we can send in a post request with an image and receive results from the prediction when we need it. In the example below, we see the response of Google Cloud Vision API:

Batch Deployment

This is usually done when we don’t know what frequency we have for the incoming data and the results aren’t immediately required. This implementation is preferrable where we aggregate the incoming infrequent data and process these in tranches, where we can make use of borrowed and temporary hosting for the basic Machine Learning model infrastructure.

As an example, take the batch processing of large stashes of documents using Apache Beam.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Edge Deployment

This is a setup where instead of passing input data to a backend, the model prediction is computed on Edge Devices.

This is preferable since it:

  • Helps improve latency, since the data is processes in house;
  • Reduces cost; and
  • Adds to the security of the data by processing sensitive data at the edge (e.g., processing PIIs on edge reduces the attack surface and therefore the risk of critical data leaks).

This deployment is usually done directly at the source of the incoming data. Within a mobile device for facial recognition or a raspberry pie for micro-controllers, for example, where the data is not being transferred to any backend but instead the entire processing is done at the point of origination itself.

The Deployment Checklist

We are going to create a simple checklist to ensure that we are formalizing the procedure for machine-learning deployments. We will take the process forward from the data science model development to the data science model deployment. We will assume that there is a working version of the model that is giving us satisfactory accuracy on the designated metric for measuring the performance of the model.

1. Determine Appropriate Deployment Structure and Pipeline

The determination of the structure of the deployment pipeline into one of the above categories, namely:

  • On-demand Deployment
  • Edge Deployment
  • Batch Processing
  • Hybrid Deployment

This is primarily determined based on the frequency of the incoming dataset to create a new structure and the “urgency” of the expected results.

2. Optimize Pre-processing Pipeline

Depending on the choice of the type of deployment, we are required to also create an appropriate ingestion pipeline that is capable of maintaining optimized flow of data through it. We not only expect the data pipeline to be up to the mark and optimized, but also when we will be looking at the consumption of the results flowing out of our model.

These Post-processing Pipelines are often the most ignored aspect of a Machine Learning implementation. We may encounter issues like limits for database ingestion rate for batch processing, or a really high number of parallel requests that are overwhelming for the server to handle.

3. Clean the Code for Model Deployment

No matter how reliable and scalable any deployment code is, there needs to be an appropriate exception handling that needs to be set up. A deployable Machine Learning code needs to have the following considerations:

  • Robust exception handling;
  • Exception logging, a generic abstract method example to use;
    Clean the code for model deployment
  • Prediction results logging to capture Production Drift;
  • Optimize resource utilization; and
  • Restricting response timings and terminating requests.

We can never ensure what kind of requests will comie in from the users. We don’t want the expensive hardware to be idle only because a function in the entire code has errored out.
Restricting a response time (in line with the maximum time we expect the model to take for the extreme entries in the inference data distribution) helps a great deal in ensuring a smooth functioning E2E pipeline.

4. Containerize Your Code

This is perhaps the most critical piece there is in ensuring we have a deployable code that scales. A stateless application is always the preferred course to deploy Machine Learning models. Containerizing ensures we never run into dependency issues if and when an open-source module is discontinued or is upgraded to an unsupported version of itself.

Containers help us by:

  • Deploying our solution at scale quickly; and
  • Quickly adapt solutions to different cloud and local environments.

The figure below shows one such solution architecture deployed on AWS with containerized inference for a Machine Learning model.

The most frequently used and easily accessible way to containerize is to use Docker. It’s easy to use and the long-running stint within the community ensures that we find solutions to any issues that we encounter during the building or a runtime process. The isolation for containerization also ensures that we are able to ensure higher security for our application.

5. Create Versions

The model, as well as the surrounding code in the container, must ensure that we have standardized versioning of the same to keep track of not only the progress, but also to keep track of any breaking changes in the subsequent upgrades that we are going to do for the same.

Some of the popular versioning tools are

  • Git
  • DVC
  • Pachyderm
  • MLMD (Machine Mearning Metadata)

6. Tracking Mechanism

A consistent monitoring mechanism is important in ensuringour models are relevant and consistent as they are deployed in a production environment. A simple implementation of a cron to run the accuracy determinations for the deployed model by keeping out a fraction of the incoming dataset and testing/benchmarking the accuracy.

It’s also important to route the requests to a cache for some time and keep track of:

Feature importance changes

Source: Deepchecks

To ensure we are developing, testing and deploying Machine Learning in production on the fly ensures that there is required consistency in the behavior of our deployment.


The production deployment of the model is the ultimate test of any data science model deployment, and should be done keeping in mind the user experience like any other application we create. We should strive to ensure our deployment has low latency, consistency in response to the same data, exception handling, drift monitoring, and automated upgrades (retraining triggers). The image below demonstrates how automated triggers for retraining models helps us maintain the quality of the models overtime.

Source: Deepchecks

These considerations and our checklist, when properly followed, will help in determining a successful deployment capable of delivering customer delight to the maximum. In the words of Godfrey Reggio, “It’s not that we use technology, we live technology”.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Recent Blog Posts