If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.

Benefits of MLOps Tools for ML Data

This blog post was written by Preet Sanghavi as part of the Deepchecks Community Blog. If you would like to contribute your own blog post, feel free to reach out to us via blog@deepchecks.com. We typically pay a symbolic fee for content that’s accepted by our reviewers.


Just like we need the right blend of spices to prepare a savory meal, one needs to follow a set of key steps for an end-to-end Machine Learning project:

  1. Collecting data and preparing an Artifact Store. Here, artifacts stored could be data, metadata, and other objects.
  2. Clean and preprocess the collected data.
  3. Write and Track the code, the data, and the version of different models involved.
  4. Prepare the production and staging environment and execute the continuous integration and continuous development (CI/CD) process.
  5. Deploy the model using an Application Programming Interface (API) or embed the model on a web application or mobile application.
  6. Monitor the model after deploying it to production.

In order to successfully execute the aforementioned steps, appropriate MLOps tools should be selected to simplify and facilitate the development path.

We have modularized important stages in ML projects and have explored different tools that can help in completing them.

Data Management and Storage Tools

Firstly, highly dependable storage space is required to keep records of the models trained, the data used, the values of parameters and hyperparameters of the model, evaluation metrics, and much more.

Here are some popular options to complete those tasks:

  1. MLFlow
  2. Comet
  3. Verta AI
  4. Neptune

While there are some differences in the workings of these tools, most of them provide a similar set of features. Let us explore data management and storage using MLFlow.

MLFlow can be incorporated by teams of any size and does not depend on libraries. It can be used in any programming language and with any Machine Learning library.

The four primary components of MLFlow

  • MLFlow Tracking. Helps track changes associated with parameters, code, and metrics.
  • MLFlow Projects. Provides support for running projects locally or remotely.
  • MLFlow Models. Standardized format for packaging models for deployment.
  • MLFlow Registry. Helps manage the entire life cycle of the MLFlow Model.

First, install the necessary library (MLFlow) with the commands you see below. You can read more about its installation here.

pip install mlflow

In the following code, a Random Forest Regressor is trained and versioned.

from random import random, randint
from sklearn.ensemble import RandomForestRegressor
 
import mlflow
import mlflow.sklearn
 
with mlflow.start_run(run_name="YOUR_RUN_NAME") as run:
    params = {"n_estimators": 5, "random_state": 42}
    sk_learn_rfr = RandomForestRegressor(**params)
 
    # Log parameters and metrics using the MLflow APIs
    mlflow.log_params(params)
    mlflow.log_param("param_1", randint(0, 100))
    mlflow.log_metrics({"metric_1": random(), "metric_2": random() + 1})
 
    # Log the sklearn model and register as version 1
    mlflow.sklearn.log_model(
        sk_model=sk_learn_rfr,
        artifact_path="sklearn-model",
        registered_model_name="sk-learn-random-forest-reg-model"
    )

Python Code for Registering a model using MLFlow

While trying to pick a relevant tool, do consider factors such as:

While trying to pick a relevant tool, do consider factors such as:

Open source package for ml validation

Build Test Suites for ML Models & Data with Deepchecks

Get StartedOur GithubOur Github

Versioning Tools:

Versioning is preferable in augmenting the overall reproducibility of the models. It is important to version our data, trained models, parameters, and hyperparameters.

While trying to figure out which versioning tools to use, do consider the criteria below.

Some of the open-sourced options available to get this done are :

a. Data Version Control (DVC):

This open-sourced platform not only allows users to version their data, but also helps perform tracking functions. Additionally, DVC allows sharing Machine Learning pipelines to different members within the team.

b. Git Large File Storage System (Git LFS):

Git LFS is particularly useful in working with large files such as audio or video by using pointers within Git. These files are fetched and edited quickly and efficiently.

git lfs install
git lfs track "*.psd"
git add file.psd
git commit -m "Add design file"
git push origin main

Git commands to track a file using Git LFS

Git commands to track a file using Git LFS

Model Tuning Tools

Once the model is trained, we look for optimization techniques to increase the overall accuracy of the model. Hyperparameter tuning is an important part of this process amongst others like data cleaning, data analysis, feature extraction and output evaluation. The different tools that help us automate model tuning are:

a. Optuna

Optuna is an open-source platform that can be used to optimize Machine Learning and Deep Learning models by finding the optimal hyperparameters using different techniques like looping or conditionals. It can be integrated with widely used Python modules like Scikit-learn, XGBoost, PyTorch, amongst others.

The command for installing the Optuna module is illustrated as:

pip install optuna

This code is used to optimize the parameter for the equation (x – 2) ** 2 iteratively with 100 trials:

import optuna
 
def objective(trial):
    x = trial.suggest_float('x', -10, 10)
    return (x - 2) ** 2
 
study = optuna.create_study()
study.optimize(objective, n_trials=100)
 
study.best_params  # E.g. {'x': 2.002108042}

Python code to optimize parameters (‘x’ in this case) using Optuna

b. Hyperopt

Defining the objective function, setting up the search space, and minimizing the objective over the search space are the key steps employed by Hyperopt. Python’s Hyperopt package allows for serial and parallel optimization over challenging spaces, including those with conditional, discrete, and real-valued dimensions.

HyperOpt module can be installed with the help of the following command:

pip install hypteropt

The following code can be used to minimize the function x ** 2 using the hyperopt functions like fmin, tpe and hp.

from hyperopt import fmin, tpe, hp
 
best = fmin(fn=lambda x: x ** 2,
    space=hp.uniform('x', -10, 10),
    algo=tpe.suggest,
    max_evals=100)
 
print best

Python code for the objective function using Hyperopt

Optimization functions, Search Space, Usage Efficiency, and Exception handling are some of the important factors that are needed to be considered while employing a model tuning tool for your Machine Learning model.

Machine Learning model

Model Deployment Tools

Before beginning the model’s construction, it is crucial to understand the type of deployment to be implemented. In addition, when adopting a model, other issues including the business use case, organizational scale, and resource availability (like GPU or Data Storage Capacity) must be considered.

There are a number of options to choose from when it comes to ML deployment platforms.

a. BentoML.

BentoML is another Python preferred open-source platform for model deployment. Your ML team can regularly deploy and enhance their models in production thanks to BentoML’s integration with your existing data stack.

The following command can be used to install the BentoML module.

pip install bentoml

This code can be used to save a SVC model in bentoML using the Iris dataset:

import bentoml

from sklearn import svm
from sklearn import datasets

# Load training data set
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Train the model
clf = svm.SVC(gamma='scale')
clf.fit(X, y)


# Save model to the BentoML local model store
bentoml.sklearn.save_model("iris_clf", clf)
print(f"Model saved: {saved_model}")

Python code to save a model using BentoML

Here is an illustration of the BentoML user interface to deploy a model to production:

BentoML user interface to deploy a model to production

Source: Web UI for Deployment

Refer to this collab notebook to get an in-depth understanding of working with BentoML.

b. Cortex.

Similar to BentoML, Cortex is a highly effective open-sourc platform that allows the deployment of any type of model. Cortex allows us to monitor model metrics post-production.

Check out this in-depth introduction to Cortex for a better understanding of Cortex.

Production Monitoring Tools

Once the model has been successfully deployed, the next step is to monitor the model in a production environment. Monitoring a model helps us find any change in data or relationship between the input and the target variable.

One should take into consideration the following criteria while trying to pick the right tool for monitoring.

  • Integration flexibility
  • Ease of monitoring
  • Concept or data drifts alerts

Here are some of the most widely used tools for Production Monitoring:

  1. AWS SageMaker
  2. SeldonCore
  3. Evidently
  4. Hydrosphere

 

Conclusion

We have gone through Data Management & Storage, Versioning, Model Tuning, Model Deployment, and Product Monitoring Tools. These have made it easy to scale and tune our model to ready it for deployment. These tools also help us in re-tracking our model in case it exhibits unexpected behavior. MLOps Tools have proved to be a boon for data scientists, Machine Learning engineers, mathematicians, and researchers to work with Machine Learning Data. One can compare and perform an in-depth analysis of different tools and platforms of MLOps suitable as per requirements here.

To explore Deepchecks’ open-source library, go try it out yourself! Don’t forget to ⭐ their Github repo, it’s really a big deal for open-source-led companies like Deepchecks.

Subscribe to Our Newsletter

Do you want to stay informed? Keep up-to-date with industry news, the latest trends in MLOps, and observability of ML systems.

Related articles

How to Choose the Right Metrics to Analyze Model Data Drift
How to Choose the Right Metrics to Analyze Model Data Drift
What to Look for in an AI Governance Solution
What to Look for in an AI Governance Solution
×

Event
Identifying and Preventing Key ML PitfallsDec 5th, 2022    06:00 PM PST

Days
:
Hours
:
Minutes
:
Seconds
Register NowRegister Now