Introduction
While data may or may not be the new oil, it is the lifeblood of Machine Learning (ML); it is fundamentally impossible to have an ML project without it.
Regardless of how sophisticated the model architecture is, if the data is of poor quality, the results will be equally bad or worse – as the old saying goes, “garbage in, garbage out.”
While it may be tempting to think that all you need to do is follow the best practices for collecting data to train on, it is an unfortunate reality that even a model trained on pristine data will degrade and needs to be retrained.
Consequently, ML models must be closely monitored once they are put into production to ensure they still provide value. Specifically, ML models need to be monitored on two levels: resource level and performance level.
Resource Level
The resource level, being a traditional realm of DevOps, deals with a critical question: “Is the program running correctly in the production environment?”
What do we mean by “correctly?”
Put simply, we want to verify that CPU/GPU, RAM, network, and disk space usage is as expected. We need to analyze latency by asking if the requests are being processed at the expected rate.

(Source)
While those working in DevOps would stop there and only worry about pushing updates whenever there is a need to squash a bug or release a new feature, those in MLOps need to monitor the model on an additional level which is the…
Performance Level
Once we confirm the model is running in production, we need to monitor its performance in production.
How do we do that?
We can use ML monitoring tools to answer a fundamental question: “Is the model performing as well in production as it did predeployment on your key metrics?”
If the answer is “Yes,” then great! Move on to the next item on your to-do list.
If it’s “No,” the next question is, “What could be causing this drop in performance?”
Given that we are talking about MLOps, the issue is almost certain to be found in the quality of the data used to make predictions and generalizations..
In accordance with the data-as-the-lifeblood-of-MLOps analogy, ML projects need infusions of new data to remain robust and effective, just as the body needs to create new red blood cells daily to stay healthy.
However, just as all blood donations are screened to avoid introducing a deadly virus to a patient, it is crucial to employ ML monitoring tools to avoid ruining the predictions made by the model by introducing data quality issues.

(Source)
But what are some of these issues?
Data Quality Issues Suitable for Monitoring
Selection Bias
This occurs before the model goes into production.
As we all know, the first step to any ML project is to identify what problem you are trying to solve to create value.
Once we have identified the problem, we can source the data we’ll use to train, test, and validate our model. At this stage, any number of different types of sample selection bias (e.g., survivorship bias, samples of convenience) can come into play, resulting in the data being used to train and test the model not being indicative of the population at large.
If the model is trained on data that doesn’t accurately reflect the population, the predictions will be useless and/or potentially harmful.
By monitoring models in production, engineers will be able to quickly identify if the data the model is receiving for making predictions in production is too dissimilar from what it was trained on.
However, assuming the model has been trained on quality data that accurately reflects the population, other data quality issues can occur such as:
Improper Values
ML models are trained on specific data types, with those values falling within a specified range. Any of the following examples can cause ML models in production to fail unless the organization employs quality ML model monitoring tools to keep the pipelines running smoothly.
Missing/Null ValuesÂ
XGBoost can handle missing values by default, but it is the outlier amongst ML models. If a value is missing, the entire observation it belongs to needs to be dropped, or it must be imputed by using a measure of central tendency (i.e., mean, median, mode) or by using a multivariate imputer or KNN imputer.
You may ask, “Which method should I use?”
In an ideal situation, you wouldn’t have any missing/null values because you’d have a complete dataset. Imagine customers are not allowed to submit their data until they have completed all fields.
But suppose you have to impute missing values. In that case, it is best to use a multivariate imputer to avoid artificially narrowing a feature’s variance by using a measure of central tendency.
This is what that process looks like when using Python:
#Load the necessary libraries import numpy as np from sklearn.impute import KNNImputer from sklearn.impute import SimpleImputer # Instantiate the imputers mean_imputer = SimpleImputer(missing_values=np.nan, strategy='mean') knn_imputer = KNNImputer(n_neighbors=2)
#Sample Data X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, np.nan, 7]] print(np.matrix(X)) [[ 1. 2. nan] [ 3. 4. 3.] [nan 6. 5.] [ 8. nan 7.]]
mean_imputer.fit_transform(X)) [[1. 2. 5.] [3. 4. 3.] [4. 6. 5.] [8. 4. 7.]]
knn_imputer.fit_transform(X) [[1. 2. 4. ] [3. 4. 3. ] [5.5 6. 5. ] [8. 5. 7. ]]
Out of Range
A value is considered out of range if it wasn’t seen before the model was put into production (i.e., the training, test, and validation phase).
In numeric values (e.g., height, weight, age, income), an out-of-range value could be an error (e.g., fat-finger typing), or it could be an actual outlier (e.g., Yao Ming was over two meters tall by the age of 13.)Â
In categorical data, which, as the name suggests, are variables that can be categorized (e.g., hair color, nationality, city), out-of-range data could be caused by spelling errors, typos, or using a different generally accepted term for the same thing (e.g., “USA,” “U.S.A,” ” usa,” “USA,” “U.S.A.”).
Luckily, you can use Deepchecks’ StringMismatch to identify examples like the one above:
First, we import the required libraries and generate the sample data.
#Import required libraries import pandas as pd from deepchecks.tabular import Dataset from deepchecks.tabular.checks import StringMismatch #Generate sample data USA = ["USA", "U.S.A", " usa", "USA ", " U.S.A."] washington_dc = ["Washington D.C."] *len(USA) data = {'country':USA, 'capital':washington_dc} df = pd.DataFrame(data=data)
Next, we pass our data to Deepchecks Dataset Wrapper and identify which features we want to check for string mismatches, run the check, and see the results.
dataset = Dataset(df, cat_features=['country', 'capital']) result = StringMismatch().run(dataset) result.show()
Shift in CardinalityÂ
While we’re on the topic of categorical data, another issue that requires ML monitoring in production is when the shape of the distribution of a categorical variable changes after the model has been put into production.
Type MismatchÂ
Imagine someone types “eighteen” instead of 18 for age. While people can quickly identify that these entries are the same, computers cannot unless they are explicitly programmed to do so, which is why ML monitoring tools are so valuable.
What Still Slips Through the Net
In a perfect world, ML monitoring software would identify all issues with data quality, freeing up everyone in the data science and ML team to work on the next project; just set it and forget it.
The reality is the exact opposite. Given that business priorities change over time, project leaders must ask, “Is the model solving the intended problem?” Just because the model is collecting and processing data as it was originally intended, doesn’t mean the data it is collecting is of any use.
While ML monitoring software can identify many data quality issues, it cannot identify the greatest issue of all: collecting data for a use-case that is no longer relevant. Given that the first step in any ML project is to frame the problem they are attempting to solve, it is paramount for leaders to convey that message to the team so they can assess if the data being collected is still right for the stated objective.
Conclusion and Next Steps
Many things can go wrong in MLOps, which is why ML monitoring tools are vital in catching the issues we discussed, as quickly as possible. Once identified, data engineers and data scientists can take all necessary steps (e.g., investigating the pipeline, retraining the model) to solve them.
Just as regular blood work is recommended for monitoring our physical health, monitoring models in production is necessary for keeping the lifeblood of our MLOps the picture of health.