Table of Contents
Fetching ...

Distribution-free two-sample testing with blurred total variation distance

Rohan Hore, Rina Foygel Barber

TL;DR

This work tackles the challenge of distribution-free two-sample testing by introducing blurred total variation, a smoothing-based relaxation of the classical TV distance. It provides distribution-free lower and upper confidence bounds for blurred TV, along with Monte Carlo estimators and bandwidth-adaptive schemes that maintain validity without distributional assumptions. A key insight is that inference quality depends on intrinsic rather than ambient dimension, enabling meaningful guarantees when data lie on or near a low-dimensional structure. The approach offers practical tools for hypothesis testing and model evaluation in high-dimensional nonparametric settings, with proofs relegated to the appendix. Overall, blurred TV serves as a principled, tractable surrogate for TV that preserves interpretability while enabling assumption-free inference.

Abstract

Two-sample testing, where we aim to determine whether two distributions are equal or not equal based on samples from each one, is challenging if we cannot place assumptions on the properties of the two distributions. In particular, certifying equality of distributions, or even providing a tight upper bound on the total variation (TV) distance between the distributions, is impossible to achieve in a distribution-free regime. In this work, we examine the blurred TV distance, a relaxation of TV distance that enables us to perform inference without assumptions on the distributions. We provide theoretical guarantees for distribution-free upper and lower bounds on the blurred TV distance, and examine its properties in high dimensions.

Distribution-free two-sample testing with blurred total variation distance

TL;DR

This work tackles the challenge of distribution-free two-sample testing by introducing blurred total variation, a smoothing-based relaxation of the classical TV distance. It provides distribution-free lower and upper confidence bounds for blurred TV, along with Monte Carlo estimators and bandwidth-adaptive schemes that maintain validity without distributional assumptions. A key insight is that inference quality depends on intrinsic rather than ambient dimension, enabling meaningful guarantees when data lie on or near a low-dimensional structure. The approach offers practical tools for hypothesis testing and model evaluation in high-dimensional nonparametric settings, with proofs relegated to the appendix. Overall, blurred TV serves as a principled, tractable surrogate for TV that preserves interpretability while enabling assumption-free inference.

Abstract

Two-sample testing, where we aim to determine whether two distributions are equal or not equal based on samples from each one, is challenging if we cannot place assumptions on the properties of the two distributions. In particular, certifying equality of distributions, or even providing a tight upper bound on the total variation (TV) distance between the distributions, is impossible to achieve in a distribution-free regime. In this work, we examine the blurred TV distance, a relaxation of TV distance that enables us to perform inference without assumptions on the distributions. We provide theoretical guarantees for distribution-free upper and lower bounds on the blurred TV distance, and examine its properties in high dimensions.
Paper Structure (47 sections, 18 theorems, 222 equations, 4 figures)

This paper contains 47 sections, 18 theorems, 222 equations, 4 figures.

Key Result

Theorem 1.1

Fix $\alpha\in[0,1]$, any $d\geq 1$, and any $n,m\geq 1$. Let $\hat{U}_\alpha$ be any (possibly randomized) distribution-free upper confidence bound for $\mathrm{d}_{\mathrm{TV}}(\cdot,\cdot)$. Then, for any pair of distributions $P,Q\in \mathcal{P}_d$ satisfying $\textnormal{atom}(P)\cap\textnormal

Figures (4)

  • Figure 1: Near-monotonic behavior of blurred TV with bandwidth $h$. Here $P=\mathcal{N}(1,1)$, $Q=\mathcal{N}(-1,1)$, with $\mathrm{d}_{\mathrm{TV}}(P,Q) = \Phi(1)-\Phi(-1) \approx 0.683$, marked with a $\star$ in the figure. $\psi$ is either the Gaussian kernel (left), or a multimodal kernel, given by a density of the mixture distribution $\tfrac{1}{3}\,\mathcal{N}(-4,1)+\tfrac{1}{3}\,\mathcal{N}(0,1)+\tfrac{1}{3}\,\mathcal{N}(4,1)$ (right).
  • Figure 2: Monte Carlo based confidence bounds on $\mathrm{d}_{\mathrm{TV}}^h(P,Q)$. In each plot, $\mathrm{d}_{\mathrm{TV}}(P,Q)$ is marked by a $\star$ symbol. See Section \ref{['sec:simulation']} for simulation details.
  • Figure 3: A visualization of the results of Section \ref{['sec:curse_of_dimensionality']}, for distributions $P,Q$ with bounded density on the unit ball.
  • Figure 4: Effect of dimension on the empirical blurred TV $\mathrm{d}_{\mathrm{TV}}^h(\widehat{P}_n,\widehat{Q}_m)$. See Section \ref{['sec:simulation_dimension']} for simulation details.

Theorems & Definitions (30)

  • Definition 1: Distribution-free confidence bounds
  • Theorem 1.1
  • Definition 2: Blurred total variation distance
  • Proposition 2.1
  • Proposition 2.2
  • Theorem 2.3
  • Proposition 2.4
  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • ...and 20 more