Thinning to improve two-sample discrepancy
Gleb Smirnov, Roman Vershynin
TL;DR
The paper tackles the problem of aligning two independent samples drawn from the same distribution by discarding a small fraction of points. It introduces an online thinning algorithm that alternates processing points from the two samples and achieves a bound $\mathbb{E}D_{n,n} \le T\log_2^{2d} n$ while discarding at most $O(n/T)$ points, for $1 \le T \le \sqrt{n}$, thereby reducing the discrepancy from the classical $O(\sqrt{n})$ scale. The analysis reduces to the uniform-marginals case via a perturbation and a randomized transform, then employs a dyadic-box decomposition, lattice-box arguments, and Chernoff-type bounds on slice counts to extend the bound to all axis-aligned boxes, achieving the stated discrepancy control with near-linear time and space. A vector-balancing perspective yields an online scheme with dimension-free guarantees, and the final results rely on careful probabilistic bounds for slices and boxes. Overall, the method provides a distribution-free, online thinning technique that significantly tightens two-sample discrepancy with manageable computational resources.
Abstract
The discrepancy between two independent samples \(X_1,\dots,X_n\) and \(Y_1,\dots,Y_n\) drawn from the same distribution on $\mathbb{R}^d$ typically has order \(O(\sqrt{n})\) even in one dimension. We give a simple online algorithm that reduces the discrepancy to \(O(\log^{2d} n)\) by discarding a small fraction of the points.
