## Introduction

Data drift is a common problem in machine learning that can significantly lower your model’s performance. The key to solving this problem is being able to detect and keep an eye on data drift.

There are several methods for tracking data drift in real-world applications. One popular method uses statistical tests to identify and monitor drift over time. However, traditional drift detection methods often require access to the entire original dataset, which may not be practical for large systems due to storage limits or privacy issues. An alternative is to sample the data first, but this approach can miss critical details, like rare events or outliers, making the results less reliable.

Another strategy involves profiling your data before applying drift detection algorithms. Data profiling involves summarizing your data’s key statistical characteristics, such as its distribution, common values, and missing information. This summary can then help adapt drift detection techniques to the data’s statistical properties. However, profiling is an estimation technique, and using it means accepting approximations in drift detection values compared to using the full dataset.

But what do we lose by profiling data before using drift detection algorithms, and how does it affect the results?

Understanding the trade-offs involved in profiling data is important to appreciate the differences in drift detection methods.

In this blog, we’ll focus on the essence of the Kolmogorov-Smirnov test, its application in detecting data drift, and how it compares with other normality tests, such as the Shapiro-Wilk test.

## Understanding the Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov (KS) test represents a non-parametric test that evaluates the differences between two cumulative distribution functions (CDFs). These functions describe the probability that a random variable takes on a value less than or equal to a certain value. In simpler terms, a CDF gives you the probability of observing a particular value or anything less in your dataset. For any given value, the CDF tells you the proportion of data points that fall at or below that value. For example, in the context of the KS test, comparing two CDFs allows you to see how the probabilities of one dataset stack up against another across the entire range of data. This comparison can help identify whether the two datasets come from the same distribution or if there are significant differences between them.

KS test can be applied in two main scenarios: comparing a sample against a known distribution (one-sample KS test) to test for normality or comparing two samples to each other (two-sample KS test) to detect data drift or distributional changes. The essence of the KS test lies in its KS score, which quantifies the maximum distance between the CDFs, providing a measure of the divergence between the distributions.

One advantage of the KS test is that it’s non-parametric, meaning it doesn’t assume the data comes from any specific type of distribution. This is helpful because we often don’t know the exact distribution of the data we’re working with in real situations. Unlike tests that assume the data must fit a specific distribution, the KS test does not require the data to adhere to any predefined distribution form.

The key part of the KS test is the KS statistic, denoted with D, which quantifies the maximum difference between the CDFs of two datasets. This statistic is used to assess the similarity between the two distributions being compared. The calculation of D helps determine whether the two datasets are likely to have come from the same distribution.

The formula for calculating the KS statistic D is as follows:

Here, andrepresent the CDFs of the first and second datasets, respectively, at any given point *x*. The absolute value of the difference between these two functions is taken at each point across the entire range of data. The maximum of these absolute differences is what constitutes the KS statistic, D.

In simpler terms, the KS statistic measures the biggest gap between the CDFs of the two groups of data we’re comparing. Here, we should also introduce an empirical CDF. The empirical CDF starts at 0 and increases to 1 as you move through the data points, with steps upward at each unique data value. The size of each step corresponds to the proportion of the dataset that has that specific value. In graphical terms, the empirical CDF gives a visual representation of the distribution of the data, showing how the total proportion of observations increases with the variable of interest.

The KS test can also be applied to verify if a sample comes from a specified distribution, such as normal, exponential, or any other defined theoretical distribution. This application is known as a goodness-of-fit test. In machine learning and data science, the KS test is used to detect data drift over time. By comparing distributions of datasets at different time points, one can identify significant shifts that may impact model performance.

## The KS Test for Normality

The KS test for normality represents a specific application of the KS test that focuses on determining whether a given dataset follows a normal (Gaussian) distribution. This aspect of the KS test is particularly relevant in statistical analyses where normality is a key assumption for many conventional parametric tests. By comparing the sample distribution to a normal distribution, the KS test provides insights into whether the data deviates significantly from normality, guiding the choice of further statistical methods or transformations.

The test calculates the maximum difference between these two CDFs as its statistic. If this difference is small, it suggests that the sample distribution closely resembles a normal distribution. Conversely, a large KS statistic indicates a significant departure from normality.

Interpreting the results of the KS test involves examining the KS score and the associated p-value. A high KS score indicates a significant difference between the distributions, while the p-value helps determine the statistical significance of this difference. The p-value represents a measure that indicates the probability of observing the data if the null hypothesis is true, thereby assessing the strength of the evidence against the null hypothesis. In the context of data drift detection, a significant result suggests that the distribution of the data has changed, emphasizing further analysis or adaptation of the model.

### KS Limitations

However, the test does have its limitations. One is that the sensitivity of the KS test can vary with the sample size. With very large samples, even minor deviations from normality can result in the rejection of the null hypothesis (that the data are from a normal distribution). This sensitivity means that the KS test might flag datasets as non-normal even when those deviations are not practically significant. Additionally, the test is best suited for continuous data and might not be optimal for discrete data or data with many repeated values.

Despite these limitations, the KS test for normality remains a popular choice for assessing distributional assumptions, particularly in exploratory data analysis phases or when evaluating the suitability of data for further parametric statistical tests.

### Python Implementation

Implementing the KS test in Python is straightforward, thanks to libraries such as SciPy. The **scipy.stats.kstest **function can be used for both one-sample and two-sample tests, allowing this method to be efficiently applied to the datasets. This accessibility enables the integration of the KS test into data preprocessing pipelines, ensuring that data drifts or distributional assumptions are validated with empirical evidence.

## Example

Imagine we have a large group of 1 million people, and we assume that their reaction times to a certain task are normally distributed across the whole group. We decide to test this assumption by taking a sample of 233 individuals from this population and recording their reaction times. When we plot these reaction times, we expect to see them forming a shape like a normal distribution. To verify this, we create a histogram of the reaction times we’ve recorded and overlay a normal distribution curve that matches the mean and standard deviation of our observed data. The resulting chart shows us how closely our sample matches a normal distribution.

In the chart, we notice that the actual distribution of our sample doesn’t match perfectly with the theoretical normal curve. We can quantify this discrepancy by calculating the proportion of the sample that falls outside the expected normal distribution areas – these are the parts of the histogram that don’t align with the normal curve, highlighted in red. This proportion represents our test statistic, a single number summarizing the difference between our sample and the expected normal distribution. It essentially tells us how much our observed data diverge from what we would expect if the null hypothesis (that the reaction times are normally distributed) were true.

If the null hypothesis is accurate, then the deviation should be relatively small, corresponding to a high probability (or p-value). Conversely, a large discrepancy suggests that it’s unlikely our sample comes from a normally distributed population, indicated by a low p-value. A common benchmark is to reject the null hypothesis if the p-value is less than 0.05.

Therefore, if our p-value is below 0.05, we conclude that it’s unlikely our variable (in this case, reaction time) is normally distributed in the population based on our sample.

## Comparison with Shapiro-Wilk Test

Given the importance of selecting the right statistical test for assessing data normality, it’s valuable to compare the KS test, known for its versatility, with the Shapiro-Wilk test, which is renowned for its accuracy in testing normality. While the KS test offers a general approach for comparing two samples or a sample against a reference distribution, the Shapiro-Wilk test is specifically designed to test the normality of a sample. The Shapiro-Wilk test focuses on how well the data conform to a normal distribution by comparing the order statistics (sorted data values) to the expected values of the normal distribution. It calculates a test statistic that represents the correlation between the data and the corresponding normal scores. A higher test statistic value indicates that the data are more likely to be normally distributed. The Shapiro-Wilk test is known for its sensitivity, especially in small to moderate-sized samples, making it a preferred choice for normality testing when sample sizes are small.

The Shapiro-Wilk test is better suited for small sample sizes (less than 50 samples) but can also accommodate larger samples. On the other hand, the KS test is recommended for use with sample sizes of 50 or more. For both tests, the null hypothesis assumes that the data come from a population that is normally distributed.

One major difference between the two tests lies in their statistical power and sensitivity to different aspects of the data distribution. The Shapiro-Wilk test is more sensitive to deviations from normality in the sense of the distribution, whereas the KS test evaluates the largest difference across the entire distribution. This means that for datasets where the concern is about the extreme values affecting normality, the Shapiro-Wilk test may provide a more accurate assessment.

## Conclusion

Understanding and detecting data drift is essential for maintaining the accuracy of machine learning models over time. First, deepen your understanding and proficiency with the KS test. Incorporate it into your data analysis and monitoring routines to effectively detect and address data drift. Second, continuously learn and share your experiences and insights with the broader community. By documenting case studies, challenges, and solutions encountered while applying the KS test in various scenarios, you contribute to the knowledge base that can guide best practices and innovative applications. Whether used alone or alongside tests like Shapiro-Wilk, the KS test provides insights that navigate the process of informed decision-making in data evolution.