How to Measure Model Drift

If you would like to contribute your own blog post, feel free to reach out to us via blog@deepchecks.com. We typically pay a symbolic fee for content that’s accepted by our reviewers.

Introduction

Successful artificial intelligence (AI) deployments require continuous model drift detection and monitoring to revalidate models on an ongoing basis. Techniques for dealing with ML model drift must determine whether and how it influences the model’s performance. When it comes to machine learning (ML) model drift, the data scientists’ essential task is determining which drift metrics to employ for their particular problem. The most common reason for model drift is related to data distribution changes. This is the change between the real-time production data and a baseline data set, likely the training set. Production data can diverge from the baseline data over time due to changes in real-world data. This article presents several techniques to measure model drift. They commonly measure the difference between means or standard deviations of datasets, but they are only helpful for normally distributed samples (see figure below). This type of metric can yield wrong information about the model drift if the data is not normally distributed.

model drift detection and monitoring

Difference of means (Source)

Model drift metrics for categorical features

For categorical features, the drift metrics can measure the distance between the discrete distributions of the empirical data defined by the probabilities of each categorical value. A group of metrics in this category calculates the drift derived from the norm distances, such as the Frobenius Norm:

>>> from scipy.linalg import norm
>>> a = np.arange(9) - 4.0
>>> a
array([-4., -3., -2., -1.,  0.,  1.,  2.,  3.,  4.])
>>> norm(a)
7.745966692414834

Frobenius norm (Source)

Functions for determining this and several different norms can be found at numpy.

Model drift metrics for numerical features

Another drift metric that is applied only for numerical features is Wasserstein Distance or the so-called “Earth mover’s distance” (EMD). It measures the effort it takes to turn one distribution into another. The input distributions can be empirical, therefore coming from samples whose values are effectively inputs of the function, or they can be seen as generalized functions. Below is an example:

>>> from scipy.stats import wasserstein_distance
>>> wasserstein_distance([0, 1, 3], [5, 6, 8])
5.0
>>> wasserstein_distance([0, 1], [0, 1], [3, 1], [2, 2])
0.25
>>> wasserstein_distance([3.4, 3.9, 7.5, 7.8], [4.5, 1.4],
                         [1.4, 0.9, 3.1, 7.2], [3.2, 3.5])
4.0781331438047861

EMD function (Source)

The resulting value of the EMD function indicates the difference between the two distributions. If the result is 0, then the distributions are the same. The higher the value, the more significant the difference between the two distributions is.

The function for determining EMD can be found at scipy.

The following model drift metrics is the Kolmogorov-Smirnov (K-S) TestThe K–S test determines whether or not an empirical distribution conforms to a theoretical distribution, or if there is a significant difference between data distributions. Because it is sensitive to the parameters of the data distribution function between both samples, the K–S test is used to compare the two samples. The K-S test checks whether the null hypothesis (which is that the two samples come from the same distribution) is true. So, this metric calculates how much the distributions of two data sets differ and come with a p-value indicating confidence in the obtained distance. If the p-value is less than 0.05, you can reject the null hypothesis and consider the two samples’ difference, indicating a model drift.

Suppose we want  to test if a dataset is distributed according to the standard normal. We choose a confidence level of 95%, which means we will reject the null hypothesis, as we said before, if the p-value is less than 0.05. In the example below, the p-value is lower than 0.05, so the null hypothesis is rejected. We can conclude that the data are not distributed according to the standard normal.

>>> from scipy import stats
>>> rng = np.random.default_rng()
>>> stats.kstest(stats.uniform.rvs(size=100, random_state=rng),
             stats.norm.cdf)
KstestResult(statistic=0.5001899973268688, pvalue=1.1616392184763533e-23)

KS-test (Source)

The function for performing K-S test can be found at scipy.

Model drift metrics for numerical or categorical features

Population stability index (PSI) is another drift metric that can be used either for numerical or categorical features. It is usually used in financial businesses. Distributions can generally be converted into histograms with an adequately determined binning technique. There are several binning methods, and each approach can generate different PSI values. The relative “size” of the drift is reflected in a way that PSI is a number that varies from 0 to infinity and holds a value of 0 if the two distributions are identical.

PSI is calculated as:

PSI = (Q(X) – P(X))ln (Q(X)/P(X))

where Q(X) and P(X) are distributions of two datasets.

Results are commonly interpreted as:

  • PSI < 0.1: two compared distributions are considered similar,
  • 0.1 ≤ PSI < 0.2: two compared distributions are moderately different,
  • PSI ≥ 0.2: two compared distributions are significantly different.

In our example below, data is composed in equi-width bins. You can see two histograms that resemble a discretized version of the respective distributions (left figure).

histograms

The example of two distributions (left) and respective histograms (right) (Source)

In the example presented in the figure above, the calculated PSI value is 0.153, which indicates that there is a possibility our population is diverting, and we may have to monitor it.

Kullback-Leibler (KL) Divergence is the next statistical model drift metric for numerical and categorical features. It measures the divergence between two probability distributions and is also known as relative entropy. KL divergence is useful if one distribution has a high variance or small sample size relative to the other. Like PSI, it yields a number in a range from 0 to infinity, and the 0 score also tells us that the distributions are identical. Unlike PSI, it is not symmetric, which means you will get different results by swapping reference and sample distributions.

distributions

KL divergence (Source)

So if we have the distributions Q(x) and P(x) of two datasets, KL divergence can be calculated by:

KL divergence can be calculated by:

The function for calculating KL divergence can be found at scipy.

Jensen-Shannon (JS) divergence can be used for numerical and categorical features. It is another way to calculate the difference between two probability distributions. JS divergence’s square root is often referred to as the Jensen-Shannon distance. This metric returns a value between zero and one. Again, zero means that the distributions are identical, and one determines that they are entirely different. This metric is calculated by:

Where M is a mixture of the two distributions and is calculated as:

The function for calculating JS divergence can be found at scipy.

Numerous platforms provide services for detecting model drift in ML applications. The enterprise solution IBM Watson Studio has an integrated approach to data and ML model deployment. It offers continuous drift metrics tracking, which yields drift alerts when they happen and minimizes the effect of model degradation. Its drift monitor examines the behavior of an ML model and creates its model to forecast whether or not a model generates an accurate prediction for a data point. The drift detection algorithm processes the data to identify the number of records for which a model makes inaccurate predictions and generates the predicted accuracy. RapidMiner is an open-source data science platform that supports many analytics users across a broad AI lifecycle. Its tools ensure sustainable model quality and value by automatically monitoring for drift, model performance degradation and service health. This platform also provides tools for academics looking for an end-to-end data science platform for instructional or research purposes. Massive Online Analysis (MOA) is another open-source tool for mining data streams and includes ML algorithms for model drift detection. It consists of a collection of offline and online techniques for ML model evaluation.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Conclusion

Many drift metrics can help you identify and quantify model drift. All metrics have different applications, interpretations, roots, and other elements that may be relevant to your model. Model drift metrics are often approximations, especially metrics for numerical or continuous features; they may make some computable assumptions and require the specification of parameters. Be cautious when choosing the right features for measuring the model drift. It should be well monitored — see how significant each feature is for model performance. Measuring the drift using only some features (unimportant or irrelevant for obtaining the desired model performance) can give wrong measurements that could contribute to the drift being overlooked

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Recent Blog Posts

LLM Evaluation: When Should I Start?
LLM Evaluation: When Should I Start?
How to Build, Evaluate, and Manage Prompts for LLM
How to Build, Evaluate, and Manage Prompts for LLM