Data Drift vs. Concept Drift

Introduction

Putting Machine Learning (ML) models into production is a great achievement, but your work does not stop there. The performance of your models may degrade over time due to a concept called “model drift.” Your model in production is constantly receiving new data to make predictions upon. However, this data might have a different probability distribution than the one you trained the model on. Using the original model with the new data distribution will cause a drop in model performance. To avoid performance degradation, you need to monitor the changes in your model’s performance.

ML model drift is a situation where a model’s performance degrades over time, causing the model to start giving poor predictions. ML model drift can be categorized into two broad categories: concept drift and data drift.

This article explains the core ideas behind data drift vs. concept drift. It covers what they are, the reasons behind them, their differences, and how to detect and handle them in an ML project.

A Note About Terminology

There is confusion about the terminology when you read about concept and data drift for many reasons.

Machine Learning is a new and dynamically growing area in the software engineering discipline, with novel ideas coming up every day in different research and business domains.

The definitions differ because of the different research, textbook, and production environments people work with. For example, “concept drift” is used as an umbrella term in online learning. However, batch learning papers refer to the same thing as “dataset drift” (e.g., here and here).

This blog post uses the terms “concept drift” and “data drift”, following widely accepted Machine Learning engineering conventions, and denotes alternatives to clarify their relationship with each other.

Concept Drift in Machine Learning

To understand what concept drift is, we need to define “Concept” within the context. Concept stands for the joint probability distribution of a Machine Learning model’s inputs (X) and outputs (Y). We can express their relationship in the following form:

P(X, Y) = P(Y) P(X|Y) = P(X) P(Y|X)

Concept drift can originate from any of the concept components. The most important source is the posterior class probability P(Y|X), as it shows how well our model understands the relationship between inputs and outputs. For this reason, people use the term “concept drift” or “real concept drift” for this specific type.

Concept shift/drift happens when posterior probabilities of X and Y, that is the probability of Y as output given X as input changes.

Pt1 (Y|X) ≠ Pt2 (Y|X)

Where:
t1 = initial time
t2 = final time

(Real) concept drift is the situation when the functional relationship between the model inputs and outputs changes. The context has changed, but the model doesn’t know about the change. Its learned patterns do not hold anymore.

Other terms for concept shift are class drift, real concept drift, or posterior probability shift.

The cause of the relationship change is usually some kind of external event/process or change in the real world. For example, if we try to predict life expectancy using geographic regions as input. As the region’s development level increases (or decreases), the effect of the region on life expectancy changes, causing the model’s prediction to not hold true anymore.

This mechanism is also behind the original understanding of “concept drift,” the change in the “meaning” of predicted labels. A common example is the shifting view of what emailing behavior we consider “normal” or “spam.” Sending emails frequently and in mass was a clear sign of spamming behavior a few years ago. Today, this is much less so. Models using these attributes to identify spam experienced concept drift and had to be retrained.

Other examples of concept change:

  • The effect of the tax law change on predicting tax compliance.
  • Changing consumer preferences when predicting products bought.
  • Predicting company profits after a financial crisis.

It is also useful to look at the other statistical components as they can affect model performance or help predict the presence of “real concept drift.” For this reason, we can distinguish the following additional sources of drift:

  • “Data drift”: Covariates P(X)
  • “Label drift”: Prior probabilities P(Y)
  • Conditional covariates: P(X|Y)

However, it is important to note that some of these probabilities affect each other because of their relationship (e.g., for P(Y) to change, P(X) or P(Y|X) also has to change).

There are other ways to categorize concept drift, like frequency, transition speed, magnitude, recurrence, etc. This graph gives a good overview. Some of the terms in the graph are discussed later in this article.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Data Drift in Machine Learning

Data drift is the situation where the model’s input distribution changes.

Pt1 (X) ≠ Pt2 (X)

People also call data drift covariate shift, virtual drift, or virtual concept drift, depending on their definition of “concept’.”Other terms are feature drift or population drift.

So what does data drift mean? A helpful way to think about this is to consider feature segments. A segment here refers to a specific categorical value or a continuous value range in our input features. Example segments are an age group, a geographical origin, or customers from particular marketing channels.

Let’s say our model works well on the entire dataset but does not produce good results on a specific segment (e.g., because of little exposure). Low segment-level performance is not a problem if the segment’s proportion is small, and we aim only for aggregate results.

However, if our overall performance drops when our model receives new data with a high proportion of the poorly predicted segments. The input distribution shift makes the model less capable of predicting labels.

In the diagram below, we see that at the start time, the data stream consists of user traffic from paid search and organic traffic. Over time, the user traffic changes (maybe due to a social media marketing campaign) and we see more paid social media traffic and less paid search. If the model was being used to predict the number of users expected on an app, its performance would drop because it was not trained on a data set with a feature showing a lot of paid social media traffic.

Data drift does not mean a change in the relationship between the input variables and the output. The model’s performance weakens because it receives data on which it hasn’t been trained enough. It can also occur if the data received by the model at inference time contains features that were not present during training.

Performance degrades in proportion to the importance of that particular feature. For example, if we try to predict cancer and train our model on non-smokers, introducing a significant smoking subpopulation can alter the results.

The main cause of data drift is the increased relevance of “under-learned” segments. This can occur via different mechanisms:

  • Sample selection bias: We have a systematic flaw in data collection, labelling, or introducing a sample selection.
  • Non-stationary environment: The training and testing environments differ (e.g., temporal change or using the model on new geography).
  • Upstream data transformation: Upstream data processing changes affect feature value distributions. (Many identify this as a different drift type, depending on their definition).

The Difference Between Data Drift And Real Concept Drift

  1. In (Real) concept drift, the decision boundary P(Y|X) changes while, in the case of data drift (or virtual drift), the boundary remains the same even though P(X) has changed.
  1. Another difference is that in data drift, the cause is somewhat internal to the process of collecting and processing data and training our model on it. In the case of concept drift, the reason is usually an external event.
  2. With data drift only the features are affected, while with concept drift, either the labels or the features or both are affected.

How to Detect Concept Drift and Data Drift

One way to detect model drift is through user feedback. However,  in high-value models, it may not be ideal for the model performance depreciation to be experienced by the users before it is detected. When an ML model is deployed to production, it needs to be monitored. The following methods can be used to detect drift:

  • Performance monitoring: When your data contains labels, you can use this method. To accomplish this, simply monitor model performance metrics such as accuracy, precision, and a variety of statistical measures. Deepchecks provides out-of-the-box methods for quickly obtaining reports on your model’s performance and the presence of concept or data drift. You can also develop your own custom metrics based on the requirements of your model. This paper on concept drift adaptation discusses some approaches that you can try.
  • Data monitoring: If your data has no labels, you can detect data drift by monitoring and comparing the statistical properties of both training and production data. These properties might include distributions, robustness, completeness, etc. Metrics are oftentimes set by the data science team to enable the monitoring tool to alert them when anything changes beyond a threshold. Some metrics include Population Stability Index (PSI), Jensen-Shannon (JS), among others.

How to Handle Concept Drift and Data Drift

To handle model drift in Machine Learning, you can use one or more of these strategies, depending on the cause of the drift:

  • Retrain or adapt your Machine Learning model: If the drift is a result of changes in the distribution of your data, you can either decide to retrain your model or adapt your model by adjusting model parameters like the training weights to account for changes in the information being carried by the data features in the model.
  • Update your training data with new data that carry current information about the relationship between the input and output data and retrain your model.
  • Scheduled model management if it is a seasonal drift.
  • Update your data pipeline: There may be data discrepancies due to some problems with the Data Engineering, for example, a change in the data schema from the API serving your ML model. This demands that you make the required changes to ensure that your model gets the data in the structure and format it was trained on.
  • Stay up to date on changes to the data schema and always update model accordingly. Let’s say you work with data on the volumes of substances produced at your organization. If the unit of measurement changes from litres to cubic metres, you should be aware in other to make appropriate changes (like retraining your model) as the expected values of your data change.
  • Online training: The advantage of online training is that your model stays updated irrespective of any changes that occur. If you observe that your model is regularly drifting, you can decide to adopt online training where you train your model as new data enters the system. This is advisable if it is cost-effective to train the model continuously; it also depends on the data latency.
  • When the drift is a result of sample selection bias, if possible, collect data points that are representative of the necessary patterns in the data.

Monitor Data Drift and Concept Drift in Your Machine Learning Workflow

To maintain the performance of your models, you need to prevent data and concept drift. To do that, you need to monitor your model to identify them in advance.

Since there are different drift types, you need to implement monitoring functions for each. You can do this manually, or you can use a Machine Learning monitoring framework like Deepchecks for that.

Deepchecks provides a framework for monitoring ml models in production. It helps you detect AI model drift so you can implement drift handles before they occur and degrade the performance of your models. Deepchecks gives a full report showing graphs (and their explanations) that visualize the presence of drift in your feature and label data.

Would you like to learn more? Check out our case studies.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Recent Blog Posts