You just finished evaluating your latest ML model, and hooray! You achieved 95% accuracy. Time to deploy the latest model!
This blogpost aims to help ML engineers understand possible pitfalls when deploying ML models into production. We will divide the topics into two main categories, namely data-related issues, and model-related issues.
Know your data
Importance of using a holdout set
Splitting your dataset into train and test data is essential for evaluating how your model will behave in the real world. If we rely only on our model’s results on the training data we are likely to overfit, and “memorize” the examples, and then we will perform poorly on new data.
It is important to note that this split can be broken by any information you receive from the test set. Hence, even by evaluating your model on the test set, you may be breaking this ideal separation, if the results affect your decision-making process.
Best practices: Use the test set as little as possible, use a validation set for hyperparameter tuning, and model selection.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)
Code example: split your data to train, validation, and test, use the test only for final evaluation
Detecting data leakage
Separating samples into training and test sets is not always enough. There can be various reasons for duplicate samples that appear in both sets, and it is important to detect and remove these duplicates. For example, one of the most used datasets in Computer Vision, CIFAR, is shown to contain duplicate samples in training and test sets.
This becomes more of a problem for models that are trained on huge amounts of data such as BERT and GPT-3.
Best practices: Remove duplicates before splitting the data, check for partial duplicates as well, sort by different columns, and examine the data.
Understanding the makeup of your data
On which types of examples does your model fail? Which examples are easier to predict?
To understand the makeup of our data use some clustering algorithm such as K-Means and run your model on samples from different clusters. Visualizing the different clusters using some 2D projection such as PCA or tSNE can be very helpful in understanding the makeup as well. This is a process that should be done during data exploration before the model development stage, but after developing our model we can now investigate how the makeup of our data relates to our model’s prediction.
>>> from sklearn.cluster import KMeans >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X) >>> kmeans.labels_ array([1, 1, 1, 0, 0, 0], dtype=int32) >>> kmeans.predict([[0, 0], [12, 3]]) array([1, 0], dtype=int32) >>> kmeans.cluster_centers_ array([[10., 2.], [ 1., 2.]])
Code example: using clustering can help you understand the makeup of your data, and the types of mistakes your model makes
t-SNE visualization of k-means clustering of mnist dataset
Data discrepancy between development and production settings
Before deploying your machine learning model, it is essential to evaluate the performance of your model on real-world data. If your model is trained on a clean dataset, it is important to generate a dataset that simulates real-world data as closely as possible and evaluate your model on this. This is especially important when the dataset does not come from exactly the same source as the production data.
Compare the structure of an actual real-world data point with your training dataset, make sure the structure is identical (e.g. column names, image size, sentence representation), compare distributions for different features to detect discrepancies.
Check for data drift in historical data
Data drift is one of the top reasons model accuracy decreases over time. It essentially means that the data distribution changes with time, and does not match the distribution of the training data. This is often caused by changes in the real world, such as new products in the market, change of season, and so on. This is something that one cannot be totally prepared for, and we recommend constant monitoring of production data to detect this.
One way to prepare is to look for data drift in historical data. Take data points from one year ago and compare the feature distribution with current data, if there is a significant change in the distribution you have data drift and are likely to experience similar drifts in the future.
Additionally, you can train your model on historical data and then evaluate it on current data, if the results are significantly different from results on the historical data, you probably are experiencing data drift or concept drift.
Data drift example: after an app update it can be easier to send automatic spam messages, and thus a higher proportion of messages will be spam (source)
Check for a strong correlation between features and target
If your results seem too good, you might be right.
When testing machine learning applications, we want to make sure our models are not “cheating”. An example of such a scenario is when some of the features have a higher correlation with the target than they should.
Imagine for instance that you are trying to predict property price given all sorts of parameters such as size, zip code, proximity to public transit, location, etc. now what if one of the fields in the training data is price/sqft. Of course, if we know the size and the price/sqft of the property, we don’t actually need an ML model to solve our problem.
One way to detect if we are dealing with such a case is to look at the correlation matrix between the different features and the target. If there are any features that have an extremely high correlation with the target, something might be fishy.
Plotting the covariance matrix as a heatmap can help understand whether some features are highly correlated (source)
Another thing to check is whether we can predict some features from other features at high rates. This enables us to detect more complex relationships between features that we might have missed otherwise.
Understand Your Model
Accuracy score on the test set is not everything. We would like our model to be robust, calibrated, and understandable so that we can minimize unwanted surprises and get good results in production.
In classification problems, our model normally predicts a probability per class. We then select the class with the highest probability, or in the case of binary classification, we use a threshold to determine the predicted label. But what exactly are these probability values, and can we actually trust them?
Model calibration is the process of updating your model so that not only the final predictions are accurate, but the probability estimate is accurate as well. This can be done by comparing the models’ predictions with the actual statistics of the appearance of each label conditioned on the features.
>>> import numpy as np >>> from sklearn.calibration import calibration_curve >>> y_true = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1]) >>> y_pred = np.array([0.1, 0.2, 0.3, 0.4, 0.65, 0.7, 0.8, 0.9, 1.]) >>> prob_true, prob_pred = calibration_curve(y_true, y_pred, n_bins=3) >>> prob_true array([0. , 0.5, 1. ]) >>> prob_pred array([0.2 , 0.525, 0.85 ])
A calibrated curve is one where there predicted probability reflects the actual probability that an example will be classified as “true”, thus the curve y=x is perfectly calibrated (source)
What makes a robust model
Typically, the winning submission in a Kaggle tournament is not necessarily the model best fit for running in production. Why? Because when models are over-complex and trained specifically to optimize some metric, they tend to generalize less well. Sometimes it’s worth sacrificing a small percentage in accuracy in order to use a model that we believe to be more solid.
So how do we make our model simpler and more resilient without sacrificing accuracy?
Reducing the dimensionality of the problem is essential in simplifying our model. We can use feature selection to choose only the most impactful features on prediction results. We lose some information, but we gain a simpler more interpretable model.
>>> from sklearn.svm import LinearSVC >>> from sklearn.datasets import load_iris >>> from sklearn.feature_selection import SelectFromModel >>> X, y = load_iris(return_X_y=True) >>> X.shape (150, 4) >>> lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y) >>> model = SelectFromModel(lsvc, prefit=True) >>> X_new = model.transform(X) >>> X_new.shape (150, 3)
Code example: Dimensionality of X is reduced from 4 to 3 using sklearn’s feature selection.
One of the hot topics in ML and DL currently is explainability. If humans are meant to trust AI systems we need to have some access to the decision-making process and be able to monitor and control it.
When deploying a model into production, it is extremely valuable for us to be able to examine the decision process of our model on a given input. When you have a clear idea of why your model makes correct and incorrect predictions, you’ll be more prepared for a potential decrease in accuracy in production, and you will have an easier time detecting the issue.
Libraries such as eli5 or Shap enable highlighting the most relevant parts of a text for the prediction (source)
To summarize, we have covered many common pitfalls you may come across when deploying your ML models to production, and we suggested some preemptive steps to ensure the good performance of your model in the production setting. We have seen that it is important to not only look at the metrics that reflect our model’s performance but to delve into the story behind them as well.