Table of Contents
Fetching ...

Efficient and Stable Multi-Dimensional Kolmogorov-Smirnov Distance

Peter Matthew Jacobs, Foad Namjoo, Jeff M. Phillips

TL;DR

This work introduces the multidimensional Kolmogorov-Smirnov distance (dKS), defined via suprema over dominating rectangles, as a unit-invariant integral probability metric suitable for comparing distributions in $\mathbb{R}^d$. It establishes dKS as a proper metric on probability measures, derives sample complexity bounds, and presents near-linear time algorithms for computing $\textsf{dKS}$ when $d\in\{2,3,4\}$, enabling delta-precision two-sample tests. The paper also contrasts dKS with existing higher-dimensional KS variants, highlighting stability advantages and providing hardness results suggesting limits to runtime improvements for higher dimensions. Practical testing procedures are given, with explicit precision controls and runtimes, accompanied by empirical demonstrations in $d=2$. The framework paves the way for robust, unit-invariant, high-dimensional distribution comparison and hypothesis testing with scalable computation.

Abstract

We revisit extending the Kolmogorov-Smirnov distance between probability distributions to the multidimensional setting and make new arguments about the proper way to approach this generalization. Our proposed formulation maximizes the difference over orthogonal dominating rectangular ranges (d-sided rectangles in R^d), and is an integral probability metric. We also prove that the distance between a distribution and a sample from the distribution converges to 0 as the sample size grows, and bound this rate. Moreover, we show that one can, up to this same approximation error, compute the distance efficiently in 4 or fewer dimensions; specifically the runtime is near-linear in the size of the sample needed for that error. With this, we derive a delta-precision two-sample hypothesis test using this distance. Finally, we show these metric and approximation properties do not hold for other popular variants.

Efficient and Stable Multi-Dimensional Kolmogorov-Smirnov Distance

TL;DR

This work introduces the multidimensional Kolmogorov-Smirnov distance (dKS), defined via suprema over dominating rectangles, as a unit-invariant integral probability metric suitable for comparing distributions in . It establishes dKS as a proper metric on probability measures, derives sample complexity bounds, and presents near-linear time algorithms for computing when , enabling delta-precision two-sample tests. The paper also contrasts dKS with existing higher-dimensional KS variants, highlighting stability advantages and providing hardness results suggesting limits to runtime improvements for higher dimensions. Practical testing procedures are given, with explicit precision controls and runtimes, accompanied by empirical demonstrations in . The framework paves the way for robust, unit-invariant, high-dimensional distribution comparison and hypothesis testing with scalable computation.

Abstract

We revisit extending the Kolmogorov-Smirnov distance between probability distributions to the multidimensional setting and make new arguments about the proper way to approach this generalization. Our proposed formulation maximizes the difference over orthogonal dominating rectangular ranges (d-sided rectangles in R^d), and is an integral probability metric. We also prove that the distance between a distribution and a sample from the distribution converges to 0 as the sample size grows, and bound this rate. Moreover, we show that one can, up to this same approximation error, compute the distance efficiently in 4 or fewer dimensions; specifically the runtime is near-linear in the size of the sample needed for that error. With this, we derive a delta-precision two-sample hypothesis test using this distance. Finally, we show these metric and approximation properties do not hold for other popular variants.

Paper Structure

This paper contains 21 sections, 13 theorems, 19 equations, 6 figures, 1 table.

Key Result

Theorem 1

For a distribution $\mu$ on $\mathbb{R}^d$, then sampling $n = O((1/\varepsilon^2)(d + \log(1/\delta))$ points $P \sim \mu$ will, with probability at least $1-\delta$, have

Figures (6)

  • Figure 1: $\textsf{dKS}(P,Q)$ uses maximizing $z \in \mathbb{R}^2$ between distributions $P$ (blue $\circ$) and $Q$ (red $\star$).
  • Figure 2: Effect of snapping a range defined by $z$ to one defined by $z'$.
  • Figure 3: Mapping to Klee's problem in $d=2$: $x \in R_z$ (green) iff $z \in r_x$ (red rectangle)
  • Figure 4: Lifting from intervals (1-d rectangles) $\mathcal{T}$ in $\mathbb{R}$ to dominating rectangles $\mathcal{R}$ in $\mathbb{R}^2$.
  • Figure 5: Hard example in $d=2$ (left) and $=3$ (right) between distributions of blue $\circ$ and red $\star$.
  • ...and 1 more figures

Theorems & Definitions (13)

  • Theorem 1
  • Lemma 2
  • Lemma 3
  • Theorem 4
  • Corollary 5
  • Lemma 6: Chan chan2013klee
  • Theorem 7
  • Corollary 8
  • Corollary 9
  • Theorem 10
  • ...and 3 more