How to Choose the Right Metrics to Analyze Model Data Drift

If you would like to contribute your own blog post, feel free to reach out to us via blog@deepchecks.com. We typically pay a symbolic fee for content that’s accepted by our reviewers.

Introduction

Machine learning models should be trained and fine-tuned to the available data for training for any given task. This dataset and relevant features for input can unexpectedly vary in various ways over time potentially resulting in drastic consequences – a housing price prediction model that is not consistently fine-tuned gives poor results over time. Any successful deployment and continuous model use therefore requires continuous monitoring for drift. Drift refers to shifts in the distribution of the underlying features that render the model trained on earlier data unusable or bear poor results. In order to successfully monitor drift, we must have an appropriate and applicable metric for the different sets of features we have for our model. In this blog, we will go over how we can undertake drift monitoring by choosing proper metrics. Although we are discussing data drifts we must also be conscious of concept drifts, the unforeseen change in the statistical properties of the target variable that makes  its features useless.

The Problem Statement

The first endeavor is to better understand the issue at hand. Suppose we have two samples of datasets, from the same underlying population with features F1 and F2. Essentially, we will train our model on F1 with its labels, where O1 and O2 are the predicted values for F1 and F2 respectively. Our objective is to ascertain whether or not our model trained on F1 will perform well on F2.

In this setup, we know that the data could have changed for F2; even though they belong to the same underlying population, F2, for example, can be sampled at a different time. We need to find an appropriate representation of such a change. It is important to identify this change since our models are trained on a specific distribution of incoming features to make predictions – if the underlying distribution shifts, the predictions of the models aren’t reliable. We will try to avoid the curse of dimensionality by going through the individual features or the smaller subsets of the feature one at a time. The figure below attempts to demonstrate such a shift in the distribution of a single feature for F1 and F2 (supposedly representing the same underlying populations). It represents the distribution of such features which, despite belonging to the same underlying population, have some differences in distribution.

We can infer that the drift metric is a metric that takes in the values of features from two different data sets (representing the same underlying population) and gives us the quantitative measure of the difference. For example,  customer preference to specific ads, if sampled at different time intervals, will reveal differences even if it is for the very same set of customers. In drift detection, a larger number is usually indicative of a greater difference with larger values.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Types of Drift Metrics

After establishing our problem statement and a definition for drift, we examine the types of drift metrics and the exact cause of distinction in each one. Machine learning models are function approximations that, in most of the cases, try to map the input variables to the output variables. Essentially, we make predictions based on an assumption that there exists a certain statistical distribution of input and output data. Now, since such distributions exist, we can therefore measure the deviation from the previously seen input feature distributions (in our example, F1 and F2). To measure such deviations we have primarily 4 types of drift metrics that we will cover:

  • KL or Kullback–Leibler (KL) Divergence
    To understand KL divergence, let’s consider two probability distributions A and B. In a normal statistical implementation, A represents the actual data or the observation (like ground truth in machine learning), and B represents the predictions. To calculate drift, A will be considered the training set feature distribution and B will be the features that are drawn from the same population but at some point t + delta (where t represents the time when training data was sampled from the same population). Here, KL divergence is interpreted as the average difference in the number of bits (i.e., space required to encode the data, calculated with base-2 algorithm) required to encode samples of B using a model optimized for A. From the definition, it is clear that A and B are interchangeable (as we can either train model on A and measure B or train model on B and measure A) to measure this difference.
    The most commonimplementation for KL divergence is based on exploiting numerical features,  dataset features likesets of samples, binned densities, or even cumulative densities. Binned densities refer to the grouping of continuous data points into more discrete ranges to reduce the complexity of learning. An important feature of this metric is that the values attained by the features (range of the input feature values) are not significant in computation for this metric – only the probability distributions of the same will make an impact. This feature can be implemented using this code:
# A single feature implementation
F_1 = [0.25, 0.33, 0.23, 0.19]
F_2 = [0.21, 0.21, 0.32, 0.26]
  
import numpy as np
from scipy.special import rel_entr
  
def kl_divergence(a, b):
    return sum(a[i] * np.log(a[i]/b[i]) for i in range(len(a)))
    
print('KL-divergence(F_1 || F_2): %.3f ' % kl_divergence(F_1,F_2))

# F1 and F2 are two feature distributions
# => KL-divergence(F_1 || F_2): 0.057
  • Jensen-Shannon Divergence or JS Divergence
    JS divergence, also known as radius or IRad, is a measure of similarity between two probability distributions. This measure, although based on the Kullback–Leibler (KL) divergence, has some significant differences. JS divergence, for one, is symmetric and always a finite value. The square root of this metric is called the Jensen–Shannon distance.
    The grounds to use JS divergence are very similar to KL divergence, given that it is used for numerical and categorical features. The notable difference is that the features of distribution can be infinite and asymmetrical JS divergence is in the range of 0 – 1 and is always symmetrical. It can be calculated directly from the python scipy library as in the following code snippet:
from scipy.spatial import distance
print(distance.jensenshannon([1.0, 0.0, 0.0], [0.0, 1.0, 0.0], 2.0))

# => 1.0
  • Population Stability Index (PSI)
    PSI for machine learning compares the distribution of predicted probability for a validation set to a training set data feature. Fundamentally, we want to check how the expected values compare with the predicted values. It is mathematically described below, where the actual is the predictions and expected is the ground truth or real values.Population Stability Index

Since it works with predicted probabilities, PSI can be used with most available datasets (e.g., numerical, categorical). An ideal situation would be during deployment, when we expect minor shifts in the data distribution but need to monitor for any substantial changes that may occur (e.g., social media feed). Since the range of PSI can tell us about the shift that has occurred, we can set up basic thresholds to check what we need to do to keep the model relevant.One such interpretation (python implementation) is:

– PSI < 0.1: no significant population change

– PSI < 0.2: moderate population change

– PSI >= 0.2: significant population change

PSI range (0 – 1)

The code snippet below calculates PSI from the two given distributions we will be passing in two instances of the same feature to the code ( represented by actual and expected), then we will calculate breakpoints (where we intend to do the splits if we are going for buckets or bins for the specific distribution). Then, we compare these histograms to the current state to state of the feature our model was supposedly trained.

import numpy as np

def calculate_psi(expected, actual, buckettype='quantiles', buckets=10, axis=0):
    def psi(expected_array, actual_array, buckets):

        def scale_range (input, min, max):
            input += -(np.min(input))
            input /= np.max(input) / (max - min)
            input += min
            return input
        breakpoints = np.arange(0, buckets + 1) / (buckets) * 100

        if buckettype == 'bins':
            breakpoints = scale_range(breakpoints, np.min(expected_array), np.max(expected_array))
        elif buckettype == 'quantiles':
            breakpoints = np.stack([np.percentile(expected_array, b) for b in breakpoints])

        expected_percents = np.histogram(expected_array, breakpoints)[0] / len(expected_array)
        actual_percents = np.histogram(actual_array, breakpoints)[0] / len(actual_array)

        def sub_psi(e_perc, a_perc):
            if a_perc == 0:
                a_perc = 0.0001
            if e_perc == 0:
                e_perc = 0.0001

            value = (e_perc - a_perc) * np.log(e_perc / a_perc)
            return(value)

        psi_value = np.sum(sub_psi(expected_percents[i], actual_percents[i]) for i in range(0, len(expected_percents)))

        return(psi_value)

    return psi(expected, actual, buckets)



expected = np.random.randn(1,100) 
actual = np.random.randn(1,100)

print(calculate_psi(expected, actual))

## Output => 0.3357

Code Credits kaggle

  • Chi-Squared Test
    The Chi-Square Test is a statistical hypothesis test that can be used when the statistics are chi-squared distributed. It is used to determine if there is any statistically significant difference that exists between the predicted outcomes and the observed frequencies.
    This test requires the statistics to be chi-squared distributed, so we must do the appropriate test on the features to determine its distribution. It can be applied to numerical features and gives statistical confidence as an outcome. It can be calculated using the python scipy module with the following code.
from scipy.stats import chisquare
chisquare([16, 18, 16, 14, 12, 12], f_exp=[16, 16, 16, 16, 16, 8])

# the outcome is the statistic (Pearson’s cumulative test statistic)
# .. the second component is the p-value
# => Power_divergenceResult(statistic=3.5, pvalue=0.6233876277495822)

Conclusion

A wide variety of metrics can help us in model data drift detection and quantification of such drift,  manage drift, and minimize the impact on the downstream dependencies of predictions from a machine learning model. The type of metric to use depends on the statistical nature of the problem at hand, the distribution to which the features belong, and the type of feature we are dealing with. The interpretations of these metrics also vary and knowing the origin of such algorithms can help in understanding the outcomes of these metrics. Some of these metrics (PSI, Chi-Squared test) don’t capture drift in rare categories (e.g., small gradual drift) and some are not able to detect it for very small changes, which might be required for some specific implementation (e.g., healthcare). It is perhaps best to use an ensemble of approaches to detect drift so we can have more accurate interpretations and detection of data drifts.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Recent Blog Posts

Precision vs. Recall in the Quest for Model Mastery
Precision vs. Recall in the Quest for Model Mastery
Ă—

Webinar Event
The Best LLM Safety-Net to Date:
Deepchecks, Garak, and NeMo Guardrails 🚀
June 18th, 2024    8:00 AM PST

Days
:
Hours
:
Minutes
:
Seconds
Register NowRegister Now