Integral Probability Metrics Meet Neural Networks: The Radon-Kolmogorov-Smirnov Test

Seunghoon Paik; Michael Celentano; Alden Green; Ryan J. Tibshirani

Integral Probability Metrics Meet Neural Networks: The Radon-Kolmogorov-Smirnov Test

Seunghoon Paik, Michael Celentano, Alden Green, Ryan J. Tibshirani

TL;DR

The paper introduces the Radon-Kolmogorov-Smirnov (RKS) test, a multivariate, higher-smoothness IPM built from Radon bounded variation (RBV) spaces. It proves a representer theorem showing the witness to the IPM is a ridge spline, equating the IPM optimization to training a two-layer neural network with a path-norm constraint; it also derives asymptotic null distributions and consistency results, and demonstrates competitive performance against MMD in diverse settings. Practically, RKS leverages neural-network optimization to approximate the IPM witness, enables permutation-based finite-sample calibration, and offers tail- and anisotropy-aware sensitivity through higher-order $k$. The work shows RKS generalizes KS to multiple dimensions with tunable smoothness, providing a complement to kernel-based tests for nonparametric two-sample problems and potential benefits for applications in anomaly detection, distribution shift, and robust generative modeling.

Abstract

Integral probability metrics (IPMs) constitute a general class of nonparametric two-sample tests that are based on maximizing the mean difference between samples from one distribution $P$ versus another $Q$, over all choices of data transformations $f$ living in some function space $\mathcal{F}$. Inspired by recent work that connects what are known as functions of $\textit{Radon bounded variation}$ (RBV) and neural networks (Parhi and Nowak, 2021, 2023), we study the IPM defined by taking $\mathcal{F}$ to be the unit ball in the RBV space of a given smoothness degree $k \geq 0$. This test, which we refer to as the $\textit{Radon-Kolmogorov-Smirnov}$ (RKS) test, can be viewed as a generalization of the well-known and classical Kolmogorov-Smirnov (KS) test to multiple dimensions and higher orders of smoothness. It is also intimately connected to neural networks: we prove that the witness in the RKS test -- the function $f$ achieving the maximum mean difference -- is always a ridge spline of degree $k$, i.e., a single neuron in a neural network. We can thus leverage the power of modern neural network optimization toolkits to (approximately) maximize the criterion that underlies the RKS test. We prove that the RKS test has asymptotically full power at distinguishing any distinct pair $P \not= Q$ of distributions, derive its asymptotic null distribution, and carry out experiments to elucidate the strengths and weaknesses of the RKS test versus the more traditional kernel MMD test.

Integral Probability Metrics Meet Neural Networks: The Radon-Kolmogorov-Smirnov Test

TL;DR

. The work shows RKS generalizes KS to multiple dimensions with tunable smoothness, providing a complement to kernel-based tests for nonparametric two-sample problems and potential benefits for applications in anomaly detection, distribution shift, and robust generative modeling.

Abstract

Integral probability metrics (IPMs) constitute a general class of nonparametric two-sample tests that are based on maximizing the mean difference between samples from one distribution

versus another

, over all choices of data transformations

living in some function space

. Inspired by recent work that connects what are known as functions of

(RBV) and neural networks (Parhi and Nowak, 2021, 2023), we study the IPM defined by taking

to be the unit ball in the RBV space of a given smoothness degree

. This test, which we refer to as the

(RKS) test, can be viewed as a generalization of the well-known and classical Kolmogorov-Smirnov (KS) test to multiple dimensions and higher orders of smoothness. It is also intimately connected to neural networks: we prove that the witness in the RKS test -- the function

achieving the maximum mean difference -- is always a ridge spline of degree

, i.e., a single neuron in a neural network. We can thus leverage the power of modern neural network optimization toolkits to (approximately) maximize the criterion that underlies the RKS test. We prove that the RKS test has asymptotically full power at distinguishing any distinct pair

of distributions, derive its asymptotic null distribution, and carry out experiments to elucidate the strengths and weaknesses of the RKS test versus the more traditional kernel MMD test.

Paper Structure (51 sections, 26 theorems, 121 equations, 15 figures, 1 table, 1 algorithm)

This paper contains 51 sections, 26 theorems, 121 equations, 15 figures, 1 table, 1 algorithm.

Introduction
Summary of contributions.
Related work.
The Radon-Kolmogorov-Smirnov test
Functions of Radon bounded variation
Pointwise evaluation of RBV functions
The RKS distance is an IPM
The RKS distance identifies the null hypothesis
Asymptotics
Experiments
Computation of the RKS distance
Experimental setup
Experimental results
Aggregating RKS tests
Discussion
...and 36 more sections

Key Result

Proposition 1

Fix any $k \geq 0$. For each $\sf \in {\mathrm{RBV}}^k$, there exists a representative $f \in \sf$ which satisfies for all $x \in \mathbb{R}^d$ for some finite signed Borel measure $\mu$ on ${\mathbb{S}^{d-1}\times[0,\infty)}$ satisfying $\| \mu \|_{{\rm TV}} = \| \sf \|_{{\mathrm{RTV}}^k}$, and some polynomial $q$ of degree at most $k$, where we use $\| \cdot \|_{{\rm TV}}$ for the total variatio

Figures (15)

Figure 1: Illustration of RKS tests for $P = {\mathcal{N}}_2(0,I)$ and $Q = {\mathcal{N}}_2(0,D)$, where $D = \text{diag}(1.4, 1)$.
Figure 2: Samples drawn from the settings in Table \ref{['table:experiments-setup']}, with $d = 2$.
Figure 4: ROC curves across the experimental settings described in Table \ref{['table:experiments-setup']}. Each row represents a different setting, and each column a different dimension. The RKS, kernel MMD ("KMMD") with a Gaussian kernel, and energy distance tests are compared. Each is coded by a combination of color and line type.
Figure 5: ROC curves across the same experimental settings as in Figure \ref{['fig:roc-setup-per-dimension']}, along with ROC curves from the min aggregation rule ("agg-min") and Fisher aggregation rule ("agg-Fisher"). In each row, only the ROC curve from the best-performing RKS test is drawn in color, and the rest are drawn in gray.
Figure 6: IPM values obtained by optimizing the "log" (x-axis) and "no-log" (y-axis) problems. Each point represents a set of samples drawn from $P,Q$, and the result of running $T=1200$ iterations with learning rate $0.01$, for each criterion. (This learning rate was chosen to be favorable to the "no-log" problem.) Points below the diagonal mean that the "log" criterion results in a larger IPM value, which we see is especially prominent for larger $k,d$, and more prominent under the null.
...and 10 more figures

Theorems & Definitions (39)

Proposition 1: Adaptation of Theorem 22 in parhi2021banach; Theorem 3.8 in parhi2022ridge
Definition 1
Theorem 1
Theorem 2
Theorem 3
Corollary 1
proof
Theorem 4
Theorem 5
Theorem 6
...and 29 more

Integral Probability Metrics Meet Neural Networks: The Radon-Kolmogorov-Smirnov Test

TL;DR

Abstract

Integral Probability Metrics Meet Neural Networks: The Radon-Kolmogorov-Smirnov Test

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (39)