This blog post was written by Tonye Harry as part of the Deepchecks Community Blog. If you would like to contribute your own blog post, feel free to reach out to us via email@example.com. We typically pay a symbolic fee for content that’s accepted by our reviewers.
Imagine having done an excellent job of creating and implementing your image classification model for identifying fraud by analyzing banknote photos with an accuracy of between 80% and 85%. A few months later, the model’s performance fell between 60% and 70%. Despite your worry and investigation, you are aware that the fundamental issue may not be with your model. Data drifts occur frequently, influencing the overall performance of your model.
Drifts can tell an organization about long-term user behavior or a general change in real-world events. They are not necessarily terrible. In situations like these businesses may need to look into it and probably adjust certain operations. For instance, during the peak of the COVID-19 outbreak, consumers reduced their purchases of clothing items. This outlier occurrence had an impact on enterprises operating in that sector because there were differences between the new data obtained and the previous data that is more representative of reality. In turn, model performance becomes suboptimal, and so these businesses need to adjust to adapt and reduce losses. Being proactive might be a better strategy for recognizing drifts, particularly in your computer vision applications.
In this article, we will discuss:
- The definition of data drifts and the various types of data drifts;
- How to find and address the drifts;
- How to prevent data drifts; and
- Automating the process of detecting data drifts and next steps.
Data Drift and Types of Data Drift
Data drift in Machine Learning happens when there is a difference between historical data utilized to train and validate the model and live production data. Some drifts in data can happen with or without model decay. The crucial factor that leads to data drift is time. There can be gaps between when the data is gathered and consumed that commonly occur in complex computer vision projects like object localization and detection, image classification, and image segmentation. Other factors are mostly data integrity concerns during the data collection process and seasonality. For example, a picture taken in the summer may look very different when that location is taken during winter with a lot of snow.
Shifts in independent variables between 2019 and 2021
Figure 1. Shift in data distribution of input data
There is a variety of data shifts, but this article looks at four important drifts you should look out for:
- Covariate Shift
- Label Shift
- Domain Shift/Adaptation
- Sample Selection Bias
This type of data shift occurs when independent variables (features) shift between the training and production environment. This is typical when moving from a controlled offline or local environment to a live dynamic one. The features encountered in the offline environment might differ from those encountered in the real world. It is also a common case with computer vision models. Input data used to train the model might have different levels of lighting, contrast, or exposure, and vice versa in the case of the production features.
This shift can occur, for instance, if the input data used to train a model that uses x-ray images to detect a specific disease is not representative of the real world use cases. The disease might be prevalent in patients 40 years and older, but the model is trained with data collected from patients between 20 and 30 years old. Obtained distribution can affect the model’s performance significantly.
In most cases, test data will differ to a degree from your training data. When thinking of your overall project, take into account the degree to which this might impact your model. For computer vision models, you can increase the robustness of your input data by augmenting the data to fit the possible scenarios of the anticipated real-world data.
The Deepchecks RobustnessReport package enables teams to check the strength of their model under varying conditions (or corruptions) that each data example is subjected to. It utilizes RandomBrightnessContrast, ShiftScaleRotate, HueSaturationValue, and RGBShift corruptions on images to test the model’s performance.
Remember that your model can be too robust, which can lead to overfitting on specific augmentations or features not relevant to the problem. You can use Deepchecks SuspiciousRobustness package to check if the input data is suspiciously robust.
# for installing deepchecks including the computer vision subpackage (note - Pytorch should be installed separately): pip install "deepcheck[vision]" --upgrade from deepchecks.vision import VisionData from deepchecks.vision.utils.detection_formatters import DetectionLabelFormatter, DetectionPredictionFormatter label_formatter = DetectionPredictionFormatter(prediction_formatter_func) ds_train = VisionData(train_loader, label_formatter = label_formatter) ds_test = VisionData(test_loader, label_formatter = label_formatter) from robustness_check import RobustnessReport Check = RobustnessReport(prediction_extract = prediction formatter) check.run(ds_test, model)
from suspicious_robustness import SuspiciousRobustness check = SuspiciousRobustness(prediction_formatter) check.run(ds_test, model)
Label shift is a distribution drift that occurs when the distribution of labels changes over time. There is a high likelihood of having different distributions of the target label after splitting the data into train and test. When this occurs, it can affect the model’s ability to generalize accurately, which results in a decline in performance.
Consider a simple image classification project to classify cats and dogs. As usual, the dataset is split into training, validation, and test sets, but the label distribution of cats and dogs varies in both sets. The test set may have more cats compared to the proportion in the training set, which could affect the model’s performance. Label shifts can happen when there is uneven data for the computer vision use case. It might work well in one scenario, but when transferred to another with new data or label distribution, the model might not perform well enough.
To help data science teams find label drift between the distributions of the train and test datasets, Deepchecks developed the TrainTestLabelDrift package. It uses metrics like Drift Score – Earth Mover’s Distance, Bounding box area distribution, Drift Score – PSI, Sample per class, and bounding box per image.
#Testing Correct Separation to Train/Test Sets Using the TrainTestLabelDrift Check from deepchecks.vision.checks.distribution import TrainTestLabelDrift TrainTestLabelDrift().run(ds_train, ds_test)
Convolutional Neural Networks (CNNs) allow for very accurate predictions, but the caveat is there has to be ample input data for the computer vision model to precisely generalize the target domain. If your use case doesn’t have a rich source of data, it can affect the model’s performance. Although there is no set limit on the number of data points, a good rule of thumb is to have 1,000 sample images for each class. This number comes from the original ImageNet challenge.
Consider an image segmentation use case for an autonomous vehicle. Models built around images captured in one country might work well in that area, but when used in another, the computer vision model might struggle to identify objects like traffic lights, road signs, and cars. that are dissimilar across countries because of a change in the problem domain.
Transfer Learning is often used to consolidate this deficit. It involves transferring the learned weights of a previously trained model to train auron a related but different problem. This means that the use case is similar, but the input data for the specific problem is different. The pre-trained model trained with more data and better resources helps the available model in better generalization with the available data. The model might have been trained with better GPU capacity and trained on data that can include that from the domain problem. You can adjust the architecture and parameters to fit your case and get optimal results.
Sample Selection Bias
Predictive models generalize according to ingested data obtained by humans and may take on some biases if the data provided is not a representative sample of real-world data. In a polarized political climate, selection bias issues can have serious social consequences and lead to a loss of business opportunities in some sense. The model maps out relationships between historical data and uses that knowledge to predict future inputs. In a case where the data is not representative of reality, the robustness of the model will be affected.
Wilson, Hoffman, and Morgenstern’s research, for instance, looked into predictive inequity in object detection and discovered that people of color were less likely to be detected by a self-driving car model. Neither time of day nor occlusion explained why the model had that predictive bias.
Take note of the variance to measure the bias of your model. Variance describes how much the model’s output changes when new data is introduced. When variance is high, the model fits more on the training set and performs poorly on the test set. A high variance occurs when the model is complex with numerous features.
Analyzing the training and test errors is another way to look for bias. In a perfect environment, test and training errors would be low, but with high bias, the model would be overly simplified and test and training errors become high.
You can also consider performing Slice-based Learning to understand what parts of your data the models work well on.
Another Drift to Consider
Concept Drift is a type of model shift that happens when properties of the dependent variable change. Data drift can be a strong indicator of concept drift. When an incident of data drift occurs, it has to be investigated since it can potentially cause model decay. To detect concept drift, you can use ADWIN (ADaptive WINdowing), Kolmogorov-Smirnov test, chi-squared test, or adversarial validation – depending on the form of data (streaming or batched) you are working with.
Concept drift: P(y|X) changes – Probability of ‘y’ (class) given that ‘X’ (input features) changes
Prediction drift: P(ŷ) changes – Probability of the prediction output distribution ‘ŷ’ changes
Label drift: P(y) changes – Class (y) probability changes
Data drift: P(X) changes – Probability of input feature (X) changes
Avoiding Data Drift
Being proactive by anticipating data drifts is the key to avoiding and/or effectively resolving them in your Machine Learning projects. While in the planning phases, data scientists have to understand the data collected, highlight metrics to monitor, and the importance of timing between data collection and deployment.
There are two ways to be proactive:
- Ensure data quality
- Monitor the ML system for data drift
To ensure data quality:
- Create expectations for your data to ensure things like missing values, distribution type, empty features, descriptive statistics, and other variables important to your project are checked. Great Expectations is a good library to use for this.
- You can use Deepchecks to validate your computer vision dataset by checking for label shift (unbalanced data) and the robustness of the data in relation to the model.
- Ensure that the historical data presented to the model is as randomized as possible and representative to a high degree of reality.
Data Drift Monitoring
Monitoring all the drifts that can happen to your model is time consuming when done manually. If that’s what you are doing at the moment, consider using open source or paid tools for your data drift monitoring needs, like Evidently AI or WhyLabs.
Drift detection can be done by:
- When using statistical tests, it should be noted that parametric tests can be more sensitive to drifts compared with non-parametric tests (like Chi-squared (one or two sample tests) and Kolmogorov-Smirnov tests), which are less sensitive. This can work efficiently if you choose the most important features to monitor and detect variations instead of choosing a lot of features making it difficult to fine-tune. \Remember that to make this effective, have KPIs for your features. For example, you select a computer vision dataset feature and you want it to have a certain metric at all times. If that metric is not met, the tool employed can alert you of that issue.
Parametric and Nonparametric Tests
- Sampling or bucketing when you have a lot of features. This might involve choosing a representative or an aggregate metric and comparing it using statistical tests.
For streaming data, your representative or aggregate metric (mean, mode, or median etc.) can be determined sequentially or constantly. Also consider using a window function for statistical tests on these types of data.
Automating Data Drift Detection
Poor detection of drifts can cost your organization not only model performance but can affect revenue as well. This can cause unnecessary PR issues if it results in ethical concerns from the public. To avoid the repercussions and enable teams to collaborate well together, different monitoring tools are available for the early detection of data drifts.
Data Drift Detection Dashboard
Figure 5. Dashboard showing data distributions
These tools can automate data drift detection to reduce the amount of tedious work that would have been done by a human. It works by setting thresholds and, through a feedback loop, alerting the team so they can act on data drifts occurrence. Some well known commercial and non-commercial tools and libraries are available for this are WhyLabs, Fiddler AI, Evidently AI, Deepchecks and skmultiflow.drift_detection for implementing ADWIN and the others.
Try to investigate the problem. Do not assume that the model is faulty after detecting a drift.
- Run checks to validate your data.
- Look out for quality issues with your data.
- Investigate the drift for further information. Sometimes drifts are a revelation of a real-world change and might not be totally bad.
- Compare data distribution to ascertain if model retraining is needed.
- Tweak your model if needed.
- Check the thresholds and recalibrate.
- Probably take a step back and look at the business model if it needs to adjust.
Always get the best out of your computer vision projects by considering the kinds of drifts you might encounter, avoiding them, or if they can’t be avoided, taking steps to overcome them. Remember that data drift in Machine Learning is common and you should be well equipped to overcome it.