Thinning to improve two-sample discrepancy

Gleb Smirnov; Roman Vershynin

Thinning to improve two-sample discrepancy

Gleb Smirnov, Roman Vershynin

TL;DR

The paper tackles the problem of aligning two independent samples drawn from the same distribution by discarding a small fraction of points. It introduces an online thinning algorithm that alternates processing points from the two samples and achieves a bound $\mathbb{E}D_{n,n} \le T\log_2^{2d} n$ while discarding at most $O(n/T)$ points, for $1 \le T \le \sqrt{n}$, thereby reducing the discrepancy from the classical $O(\sqrt{n})$ scale. The analysis reduces to the uniform-marginals case via a perturbation and a randomized transform, then employs a dyadic-box decomposition, lattice-box arguments, and Chernoff-type bounds on slice counts to extend the bound to all axis-aligned boxes, achieving the stated discrepancy control with near-linear time and space. A vector-balancing perspective yields an online scheme with dimension-free guarantees, and the final results rely on careful probabilistic bounds for slices and boxes. Overall, the method provides a distribution-free, online thinning technique that significantly tightens two-sample discrepancy with manageable computational resources.

Abstract

The discrepancy between two independent samples $X_1,\dots,X_n$ and $Y_1,\dots,Y_n$ drawn from the same distribution on $\mathbb{R}^d$ typically has order $O(\sqrt{n})$ even in one dimension. We give a simple online algorithm that reduces the discrepancy to $O(\log^{2d} n)$ by discarding a small fraction of the points.

Thinning to improve two-sample discrepancy

TL;DR

while discarding at most

points, for

, thereby reducing the discrepancy from the classical

scale. The analysis reduces to the uniform-marginals case via a perturbation and a randomized transform, then employs a dyadic-box decomposition, lattice-box arguments, and Chernoff-type bounds on slice counts to extend the bound to all axis-aligned boxes, achieving the stated discrepancy control with near-linear time and space. A vector-balancing perspective yields an online scheme with dimension-free guarantees, and the final results rely on careful probabilistic bounds for slices and boxes. Overall, the method provides a distribution-free, online thinning technique that significantly tightens two-sample discrepancy with manageable computational resources.

Abstract

The discrepancy between two independent samples

and

drawn from the same distribution on

typically has order $O(\sqrt{n})$ even in one dimension. We give a simple online algorithm that reduces the discrepancy to $O(\log^{2d} n)$ by discarding a small fraction of the points.

Thinning to improve two-sample discrepancy

TL;DR

Abstract

Thinning to improve two-sample discrepancy

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (11)