DEEPCHECKS GLOSSARY

# Kolmogorov-Smirnov Test

## What is Kolmogorov-Smirnov Test?

The Kolmogorov-Smirnov Test (K-S test), a nonparametric statistical analysis tool, compares the distributions of two sample datasets or one sample dataset with a reference probability distribution.

This test gauges the maximum distance between either:

1) The empirical distribution functions of both samples.
2) The empirical data’s own distribution function and that of an external cumulative reference.

We frequently employ this method to ascertain significant differences in two datasets, as well as discern if one dataset adheres to specific distributions; its applications extend across disciplines such as finance, physics, and even environmental science for robust data-distribution analyses.

## Kolmogorov-Smirnov test in R

Utilize the ‘ks.test()’ function, an inherent tool in R’s stats package, to execute a Kolmogorov-Smirnov test. This function necessitates two principal arguments: the vectors of data for both samples or a vector of data and a cumulative distribution function for comparison. The output includes the test statistic and p-value; critically determining these is instrumental in deciding if we should reject our null hypothesis – that both distributions indeed remain identical.

### How to run a Kolmogorov-Smirnov Test?

• Choose Datasets: This is pivotal to establishing your comparative goals. For example, you may seek comparisons between the height distributions of disparate populations; alternatively, you could examine whether observed data sets align themselves with normal distribution patterns. Ensure the datasets or distributions you select align directly with the hypothesis under examination: this is paramount.
• When formulating the hypothesis, one must clearly define both the null hypothesis (typically asserting that two distributions under comparison are identical) and the alternative hypothesis (which postulates a difference between these two distributions). Before conducting the test, articulating these hypotheses is essential; they serve as a guide to interpreting the results.
• Calculate the Empirical Distribution Functions (EDFs): By processing each dataset independently. The computation of EDF for every set involves ascertaining the proportion of data points within that specific dataset that are less than or equal to its individual values. This demands sorting the given datasets and subsequently calculating a cumulative frequency for each value.
• Find the Maximum Distance: Between either two Empirical Distribution Functions (EDFs) or an EDF and a reference distribution’s Cumulative Distribution Function (CDF). To accomplish this – a crucial step in the test – one must identify and gauge the largest vertical gap between these EDFs at any given point across their ranges. Denoted as D, this maximum distance acts as a pivotal statistic in the test; it evaluates the statistical significance of observed distributional differences. A larger value for D yields stronger evidence against our null hypothesis – suggesting a significant disparity between distributions.
• Determine the Significance Level: The denotation for this significance level is alpha (α), and it serves as a threshold to discern whether observed differences between distributions hold statistical significance. A typical choice of α stands at 0.05 or 0.01, suggesting a chance-respectively-of 5% or 1% that one would reject the null hypothesis if indeed true. The choice of a lower α: this symbolizes a significance level in statistical hypothesis testing-carries implications for errors; specifically, it diminishes the probability of Type I error (false positive). However, it concurrently heightens the peril of committing a Type II error or false negative. This decision pivots on two factors: firstly, the specific context in which one is conducting their test; secondly – the tolerance level towards potentially drawing incorrect conclusions (how much risk can be borne for arriving at erroneous results).
• Compare with Critical Value or Use P-Value: To conclude with interpreting the Kolmogorov-Smirnov test, you must compare your chosen significance level’s critical value for the K-S test with a calculated maximum distance (D statistic). Should this D exceed its corresponding critical value, rejection of the null hypothesis is warranted. Alternatively, it is possible to compute a p-value linked to this D statistic. A p-value less than or equal to the significance level (α) not only suggests you should reject the null hypothesis but also implies a significant difference between distributions. The final comparison – effectively determining if observed differences in distributions bear statistical meaning-is crucial here.
• Concluding the Kolmogorov-Smirnov test involves evaluating the results. Compare the D statistic and p-value to your predetermined significance level (α). Should either of these conditions occur – if the D statistic surpasses its critical value or if a p-value below α is present – you must reject your null hypothesis. Such rejection signifies a statistically significant difference between distributions. If you do not meet these conditions, the null hypothesis cannot be rejected; this suggests that no significant difference exists between the distributions. Your initial hypothesis about the similarity of distributions is ultimately confirmed or refuted in this final step.

## Kolmogorov-Smirnov test of normality

Lastly, let’s touch on the subject of the Kolmogorov-Smirnov test of normality. It’s a nonparametric tool and specific application of the K-S test that gauges whether a sample originates from a population distributed normally rather than leaning on the distribution’s mean and variance like other normality tests. It juxtaposes the empirical distribution function of sample data with an anticipated cumulative distribution function from normal distributions to calculate this test’s maximum difference between these two functions. A significant difference suggests a non-normal distribution of the sample data. Particularly for small sample sizes, this test proves useful-it neither assumes nor requires specific parameters of a normal distribution.