Putting machine learning models into production is a great achievement, but your work does not stop there. The performance of your models will degrade over time because of concept or data drift.
Your model in production is constantly receiving new data to make predictions upon it. However, this data might have a different probability distribution than the one you have trained the model. Using the original model with the new data distribution will cause a drop in model performance.
To avoid performance degradation, you need to monitor these changes.
However, when you learn more about these concepts, you face a proliferation of terms and multitudes of ‘drift’ and ‘shift’ types that sometimes contradict each other.
This article helps you with that.
This article explains to you the core ideas behind data drift vs. concept drift. You will learn their differences, the reasons behind them, and how you can differentiate them.
By the end of the article, you will have a solid grasp of what concept and data drift and will be able to evaluate how big a problem they are for your machine learning project.
A Note About Terminology
There is confusion about the terminology when you read about concept and data drift for many reasons.
Machine Learning is a new but dynamically growing area in the software engineering context, with novel ideas coming up every day in different research and business domains.
The definitions also differ because of the different research, textbook, and production environments people work with these models. For example, “concept drift” is a term used as an umbrella term in online learning. However, batch learning papers refer to the same thing as “dataset drift” (e.g., here and here).
This blog post uses “concept drift” and “data drift” following machine learning engineering conventions and denotes alternatives to clarify their relationship with each other.
To know what concept drift is, we need a definition of “concept”. Concept stands for the joint probability distribution of a machine learning model’s inputs (X) and outputs (Y). We can express their relationship in the following form:
P(X, Y) = P(Y) P(X|Y) = P(X) P(Y|X)
Concept shift happens when the joint distribution of inputs and outputs changes:
Pt1 (X, Y) ≠ Pt2 (X, Y)
Concept drift can originate from any of the concept components. The most important source is the posterior class probability P(Y|X) , as it shows how well our model understands the relationship between inputs and outputs. For this reason, people use the term “concept drift” or “real concept drift” for this specific type.
It is also useful to look for the other components as they can affect model performance or predict the presence of “real concept drift”. For this reason, we can distinguish the following additional sources of drift:
- “Data drift”: Covariates P(X)
- “Label drift”: Prior probabilities P(Y)
- Conditional covariates: P(X|Y)
However, it is important to note that some of these probabilities affect each other because of their relationship (e.g., for P(Y) to change, P(X) or P(Y|X) also has to change).
There are other ways to categorize concept drift like frequency, transition speed, magnitude, and recurrence. We don’t have the space to discuss them, but this graph gives a good overview:
Data Drift in Machine Learning
Data drift is the situation where the model’s input distribution changes.
Pt1 (X) ≠ Pt2 (X)
People also call data drift covariate shift, virtual drift, or virtual concept drift depending on their definition of ‘concept’. Other terms are feature drift or population drift.
So what does data drift mean? A helpful way to think about this is to consider feature segments. A segment here refers to a specific categorical value or a continuous value range in our input features. Example segments are an age group, a geographical origin, or customers from particular marketing channels.
Let’s say our model works well on the entire dataset, but it does not produce good results on a specific segment (e.g., because of little exposure). Low segment-level performance is not a problem if the segment’s proportion is small, and we aim only for aggregate results.
However, our overall performance drops when our model receives new data with a high proportion of the poorly predicted segment. The input distribution shift makes the model less capable of predicting labels.
Data drift does not mean a change in the relationship between the input variables and the output. It weakens performance because the model receives data on which it hasn’t trained enough.
Performance degrades in proportion to the importance of that particular feature. For example, if we try to predict cancer and train our model on non-smokers, introducing a significant smoking subpopulation can alter the results.
The main cause of data drift is the increased relevance of “underlearned” segments. This can occur via different mechanisms:
- Sample selection bias: We have a systematic flaw in data collection, labeling, or introducing a sample selection.
- Non-stationary environment: The training and testing environments differ (e.g., temporal change or using the model on new geography).
- Upstream data transformation: Upstream data processing changes affect feature value distributions. (Many identify this as a different drift type, depending on their definition).
Concept Drift in Machine Learning
(Real) concept drift is the situation when the functional relationship between the model inputs and outputs changes. The context has changed, but the model doesn’t know about the change. Its learned patterns do not hold anymore.
A concept drift means changes of posterior probabilities between two situations.
Pt1 (Y|X) ≠ Pt2 (Y|X)
Other terms for concept shift are class drift, real concept drift, or posterior probability shift.
The cause of the relationship change is some kind of external event or process. For example, we try to predict life expectancy using geographic regions as input. As the region’s development level increases (or decreases) region loses its predictive power, and our model degrades.
This mechanism is also behind the original understanding of ‘concept drift’, the change of “meaning” of predicted labels. A common example is the shifting view of what emailing behavior we consider “normal” or “spam”. Sending emails frequently and in mass was a clear sign of spamming behavior a few years ago. Today, this is much less so. Models using these attributes to identify spam experienced concept drift and had to be retrained.
Other examples of concept change:
- The effect of the tax law change on predicting tax compliance.
- Changing consumer preferences when predicting products bought.
- Predicting company profits after a financial crisis.
The Difference Between “Data Drift” And Real Concept Drift
In the case of (real) concept drift, the decision boundary P(Y|X) changes. In the case of data drift (or virtual drift), the boundary remains the same even though P(X) has changed.
Another difference is that in data drift, the cause is somewhat internal to the process of collecting and processing data and training our model on it. In the case of concept drift, the reason is usually external to this process.
Other Drift Types and Related Issues
Data drift and real concept drift are not the only types of concept drift. Other non-drift issues are similar to the effects and mechanisms of concept drift. They can have similarly big effects on your model, depending on your situation. It is useful to keep them in mind.
Here is a list of the most important ones:
- Prior Probability Shift: Prior probability shift means a change in the class priors, in the output’s statistical distribution Pt1 (Y) ≠ Pt2 (Y). Other command terms for prior probability shift are class probability shift, label drift, real concept shift, class drift.
- Novel Class Appearance: This is the case when in a classification problem, your model needs to predict a label that it hasn’t seen before. Pt1 (Y = y) = 0 and Pt2 (Y = y) > 0. New unseen labels be a typical result of a change in upstream data collection (e.g., a new field in a form) introducing a new value.
- Subconcept drift (intersected drift) and Full-concept drift (severe drift): Within concept drift, we can further differentiate between subconcept and full-concept drift depending on whether the drift affects the whole domain of X or just a section of it.
Monitor Data Drift and Concept Drift in Your Machine Learning Workflow
In this article, you learned what data drift and concept drift are, their differences, and the main reasons behind them.
To maintain the performance of your models, you need to prevent data and concept drift. To do that, you need to monitor your model to identify them in advance.
As there are different drift types, you need to implement monitoring functions for each. You can do this manually, or you can use a monitoring framework for that like Deepchecks.
Deepchecks provides a framework for monitoring ml models in production. It helps you detect AI model drift so you can implement drift handles before they occur and degrade the performance of your models.