Table of Contents
Fetching ...

Integral Probability Metrics Meet Neural Networks: The Radon-Kolmogorov-Smirnov Test

Seunghoon Paik, Michael Celentano, Alden Green, Ryan J. Tibshirani

TL;DR

The paper introduces the Radon-Kolmogorov-Smirnov (RKS) test, a multivariate, higher-smoothness IPM built from Radon bounded variation (RBV) spaces. It proves a representer theorem showing the witness to the IPM is a ridge spline, equating the IPM optimization to training a two-layer neural network with a path-norm constraint; it also derives asymptotic null distributions and consistency results, and demonstrates competitive performance against MMD in diverse settings. Practically, RKS leverages neural-network optimization to approximate the IPM witness, enables permutation-based finite-sample calibration, and offers tail- and anisotropy-aware sensitivity through higher-order $k$. The work shows RKS generalizes KS to multiple dimensions with tunable smoothness, providing a complement to kernel-based tests for nonparametric two-sample problems and potential benefits for applications in anomaly detection, distribution shift, and robust generative modeling.

Abstract

Integral probability metrics (IPMs) constitute a general class of nonparametric two-sample tests that are based on maximizing the mean difference between samples from one distribution $P$ versus another $Q$, over all choices of data transformations $f$ living in some function space $\mathcal{F}$. Inspired by recent work that connects what are known as functions of $\textit{Radon bounded variation}$ (RBV) and neural networks (Parhi and Nowak, 2021, 2023), we study the IPM defined by taking $\mathcal{F}$ to be the unit ball in the RBV space of a given smoothness degree $k \geq 0$. This test, which we refer to as the $\textit{Radon-Kolmogorov-Smirnov}$ (RKS) test, can be viewed as a generalization of the well-known and classical Kolmogorov-Smirnov (KS) test to multiple dimensions and higher orders of smoothness. It is also intimately connected to neural networks: we prove that the witness in the RKS test -- the function $f$ achieving the maximum mean difference -- is always a ridge spline of degree $k$, i.e., a single neuron in a neural network. We can thus leverage the power of modern neural network optimization toolkits to (approximately) maximize the criterion that underlies the RKS test. We prove that the RKS test has asymptotically full power at distinguishing any distinct pair $P \not= Q$ of distributions, derive its asymptotic null distribution, and carry out experiments to elucidate the strengths and weaknesses of the RKS test versus the more traditional kernel MMD test.

Integral Probability Metrics Meet Neural Networks: The Radon-Kolmogorov-Smirnov Test

TL;DR

The paper introduces the Radon-Kolmogorov-Smirnov (RKS) test, a multivariate, higher-smoothness IPM built from Radon bounded variation (RBV) spaces. It proves a representer theorem showing the witness to the IPM is a ridge spline, equating the IPM optimization to training a two-layer neural network with a path-norm constraint; it also derives asymptotic null distributions and consistency results, and demonstrates competitive performance against MMD in diverse settings. Practically, RKS leverages neural-network optimization to approximate the IPM witness, enables permutation-based finite-sample calibration, and offers tail- and anisotropy-aware sensitivity through higher-order . The work shows RKS generalizes KS to multiple dimensions with tunable smoothness, providing a complement to kernel-based tests for nonparametric two-sample problems and potential benefits for applications in anomaly detection, distribution shift, and robust generative modeling.

Abstract

Integral probability metrics (IPMs) constitute a general class of nonparametric two-sample tests that are based on maximizing the mean difference between samples from one distribution versus another , over all choices of data transformations living in some function space . Inspired by recent work that connects what are known as functions of (RBV) and neural networks (Parhi and Nowak, 2021, 2023), we study the IPM defined by taking to be the unit ball in the RBV space of a given smoothness degree . This test, which we refer to as the (RKS) test, can be viewed as a generalization of the well-known and classical Kolmogorov-Smirnov (KS) test to multiple dimensions and higher orders of smoothness. It is also intimately connected to neural networks: we prove that the witness in the RKS test -- the function achieving the maximum mean difference -- is always a ridge spline of degree , i.e., a single neuron in a neural network. We can thus leverage the power of modern neural network optimization toolkits to (approximately) maximize the criterion that underlies the RKS test. We prove that the RKS test has asymptotically full power at distinguishing any distinct pair of distributions, derive its asymptotic null distribution, and carry out experiments to elucidate the strengths and weaknesses of the RKS test versus the more traditional kernel MMD test.
Paper Structure (51 sections, 26 theorems, 121 equations, 15 figures, 1 table, 1 algorithm)

This paper contains 51 sections, 26 theorems, 121 equations, 15 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Fix any $k \geq 0$. For each $\sf \in {\mathrm{RBV}}^k$, there exists a representative $f \in \sf$ which satisfies for all $x \in \mathbb{R}^d$ for some finite signed Borel measure $\mu$ on ${\mathbb{S}^{d-1}\times[0,\infty)}$ satisfying $\| \mu \|_{{\rm TV}} = \| \sf \|_{{\mathrm{RTV}}^k}$, and some polynomial $q$ of degree at most $k$, where we use $\| \cdot \|_{{\rm TV}}$ for the total variatio

Figures (15)

  • Figure 1: Illustration of RKS tests for $P = {\mathcal{N}}_2(0,I)$ and $Q = {\mathcal{N}}_2(0,D)$, where $D = \text{diag}(1.4, 1)$.
  • Figure 2: Samples drawn from the settings in Table \ref{['table:experiments-setup']}, with $d = 2$.
  • Figure 4: ROC curves across the experimental settings described in Table \ref{['table:experiments-setup']}. Each row represents a different setting, and each column a different dimension. The RKS, kernel MMD ("KMMD") with a Gaussian kernel, and energy distance tests are compared. Each is coded by a combination of color and line type.
  • Figure 5: ROC curves across the same experimental settings as in Figure \ref{['fig:roc-setup-per-dimension']}, along with ROC curves from the min aggregation rule ("agg-min") and Fisher aggregation rule ("agg-Fisher"). In each row, only the ROC curve from the best-performing RKS test is drawn in color, and the rest are drawn in gray.
  • Figure 6: IPM values obtained by optimizing the "log" (x-axis) and "no-log" (y-axis) problems. Each point represents a set of samples drawn from $P,Q$, and the result of running $T=1200$ iterations with learning rate $0.01$, for each criterion. (This learning rate was chosen to be favorable to the "no-log" problem.) Points below the diagonal mean that the "log" criterion results in a larger IPM value, which we see is especially prominent for larger $k,d$, and more prominent under the null.
  • ...and 10 more figures

Theorems & Definitions (39)

  • Proposition 1: Adaptation of Theorem 22 in parhi2021banach; Theorem 3.8 in parhi2022ridge
  • Definition 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Corollary 1
  • proof
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • ...and 29 more