If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.

How to Validate Your ML Model Before Deploying into Production

Introduction

You just finished evaluating your latest ML model, hooray! You achieved 95% accuracy. Time to deploy the latest model!

Not exactly.

This blogpost aims to help ML engineers understand possible pitfalls when deploying ML models into production. We divided the topics into two main categories, namely data-related issues and model-related issues.

Know Your Data

Importance of Using a Hold-out Set

Splitting your dataset into train and test is essential to evaluate how your model will behave in the real world. If we rely only on our model’s results on the training data we are likely to overfit and “memorize” the examples, making us perform poorly on new data.

It is important to note that this split can be broken by any information you receive from the test set. So even by evaluating your model on the test set, you may be breaking this ideal separation if the results affect your decision-making process.

Best practices: Use the test set as little as possible, use a validation set for hyperparameter tuning and model selection.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) 
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

Code example: split your data to train, validation, and test, use the test only for final evaluation

Detecting Data Leakage

Separating samples into training and test sets is not always enough. There can be various reasons for duplicate samples that appear in both sets, and it is important to detect and remove these duplicates. For example, one of the most used datasets in Computer Vision, CIFAR, is shown to contain duplicate samples in training and test sets.

This becomes more of a problem for models that are trained on huge amounts of data such as BERT and GPT-3.

Best practices: Remove duplicates before splitting the data, check for partial duplicates as well, sort by different columns, and examine the data.

Understanding the Makeup of Your Data

On which examples does your model fail? Which examples are easier to predict?

To understand the makeup of our data, use some clustering algorithm such as K-Means and run your model on samples from different clusters. Visualizing the different clusters using a2D projection like PCA or tSNE can be very helpful in understanding the makeup. This process should be done during data exploration before the model development stage, but after developing our model we can now investigate how the makeup of our data relates to our model’s prediction.

>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...              [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10.,  2.],
       [ 1.,  2.]])

Code example: using clustering can help you understand the makeup of your data, and the types of mistakes your model makes

t-SNE visualization of k-means clustering of mnist dataset

Data Discrepancy Between Development and Production Settings

Before deploying your ML model, it is essential to evaluate its performance on real-world data. If your model is trained on a clean dataset, it is important to generate a dataset that simulates real-world data as closely as possible and evaluate your model on it. This is especially important when the dataset does not come from exactly the same source as the production data.

Best practices:Compare the structure of an actual real-world data point with your training dataset, make sure the structure is identical (e.g., column names, image size, sentence representation), compare distributions for different features to detect discrepancies.

Check for Data Drift in Historical Data

Data drift is one of the top reasons model accuracy decreases over time. It essentially means that the data distribution changes with time and does not match the distribution of the training data. This is often caused by changes in the real world such as new products in the market or change of season. This is something that we cannot be totally prepared for, o we recommend constant monitoring of production data to detect this.

One way to prepare is to look for data drift in historical data. Take data points from one year ago and compare the feature distribution with current data, if there is a significant change in the distribution you have data drift and are likely to experience similar drifts in the future.

Additionally, you can train your model on historical data and then evaluate it on current data, if the results are significantly different from results on the historical data, you probably are experiencing data drift or concept drift.

Data drift example: after an app update it can be easier to send automatic spam messages, and thus a higher proportion of messages will be spam (source)

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Deepchecks HubOur GithubOpen Source

Check for a Strong Correlation Between Features and Target

If your results seem too good, you might be right.

When testing ML applications, we want to make sure our models are not “cheating.” An example is when some of the features have a higher correlation with the target than they should.

Imagine that you are trying to predict property price given all sorts of parameters such as size, zip code, proximity to public transit,and  location. Now, what if one of the fields in the training data is price/sqft? Of course, if we know the size and the price/sqft of the property, we don’t actually need an ML model to solve our problem.

One way to know if we are dealing with such a case is by looking at the correlation matrix between the different features and the target. If there are any features that have an extremely high correlation with the target, something might be fishy.

Plotting the covariance matrix as a heatmap can help understand whether some features are highly correlated (source)

Another thing to check is whether we can predict some features from other features at high rates. This enables us to detect more complex relationships between features that we might have missed otherwise.

Understand Your Model

The accuracy score on the test set is not everything. We would want our model to be robust, calibrated, and understandable  to  minimize unwanted surprises and get good results in production.

Model Calibration

In classification problems, our model normally predicts a probability per class. We then select the class with the highest probability, or in the case of binary classification, we use a threshold to determine the predicted label. But what exactly are these probability values, and can we actually trust them?

Model calibration is the process of updating your model so not only are the final predictions accurate, but the probability estimate as well. This can be done by comparing the models’ predictions with the actual statistics of the appearance of each label conditioned on the features.

Code example:

>>> import numpy as np
>>> from sklearn.calibration import calibration_curve
>>> y_true = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1])
>>> y_pred = np.array([0.1, 0.2, 0.3, 0.4, 0.65, 0.7, 0.8, 0.9,  1.])
>>> prob_true, prob_pred = calibration_curve(y_true, y_pred, n_bins=3)
>>> prob_true
array([0. , 0.5, 1. ])
>>> prob_pred
array([0.2  , 0.525, 0.85 ])

A calibrated curve is one where there predicted probability reflects the actual probability that an example will be classified as “true”, thus the curve y=x is perfectly calibrated (source)

What makes a robust model

The winning submission in a Kaggle tournament is not necessarily the model best fit for running in production. This is because when models are over-complex and trained specifically to optimize some metric, they tend to generalize. Sometimes, it’s worth sacrificing a small percentage in accuracy to use a model that we believe is more solid.

So how do we make our model simpler and more resilient without sacrificing accuracy?

Reducing the dimensionality of the problem is essential in simplifying our model. We can use feature selection to choose only the most impactful features on prediction results. We lose some information, but we gain a simpler more interpretable model.

>>> from sklearn.svm import LinearSVC
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> X, y = load_iris(return_X_y=True)
>>> X.shape
(150, 4)
>>> lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
>>> model = SelectFromModel(lsvc, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape
(150, 3)

Code example: Dimensionality of X is reduced from 4 to 3 using sklearn’s feature selection.

Explain Yourself!

One of the hot topics in ML and DL is explainability. If humans are meant to trust AI systems, we need to have some access to the decision-making process and be able to monitor and control it.

When deploying a model into production, it is extremely valuable for us to be able to examine the decision process of our model on a given input. When you have a clear idea of why your model makes correct and incorrect predictions, you’ll be more prepared for a potential decrease in accuracy in production, and you will have an easier time detecting the issue.

Libraries such as eli5 or Shap enable highlighting the most relevant parts of a text for the prediction (source)

Final Remarks

We have covered many common pitfalls you may come across when deploying your ML models to production and suggestions   to ensure your model’s good performance in the production setting. We have seen that it is important to not only look at the metrics that reflect our model’s performance, but to delve into the story behind them as well.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Deepchecks Hub Our GithubOpen Source