Table of Contents
Fetching ...

Thinning to improve two-sample discrepancy

Gleb Smirnov, Roman Vershynin

TL;DR

The paper tackles the problem of aligning two independent samples drawn from the same distribution by discarding a small fraction of points. It introduces an online thinning algorithm that alternates processing points from the two samples and achieves a bound $\mathbb{E}D_{n,n} \le T\log_2^{2d} n$ while discarding at most $O(n/T)$ points, for $1 \le T \le \sqrt{n}$, thereby reducing the discrepancy from the classical $O(\sqrt{n})$ scale. The analysis reduces to the uniform-marginals case via a perturbation and a randomized transform, then employs a dyadic-box decomposition, lattice-box arguments, and Chernoff-type bounds on slice counts to extend the bound to all axis-aligned boxes, achieving the stated discrepancy control with near-linear time and space. A vector-balancing perspective yields an online scheme with dimension-free guarantees, and the final results rely on careful probabilistic bounds for slices and boxes. Overall, the method provides a distribution-free, online thinning technique that significantly tightens two-sample discrepancy with manageable computational resources.

Abstract

The discrepancy between two independent samples \(X_1,\dots,X_n\) and \(Y_1,\dots,Y_n\) drawn from the same distribution on $\mathbb{R}^d$ typically has order \(O(\sqrt{n})\) even in one dimension. We give a simple online algorithm that reduces the discrepancy to \(O(\log^{2d} n)\) by discarding a small fraction of the points.

Thinning to improve two-sample discrepancy

TL;DR

The paper tackles the problem of aligning two independent samples drawn from the same distribution by discarding a small fraction of points. It introduces an online thinning algorithm that alternates processing points from the two samples and achieves a bound while discarding at most points, for , thereby reducing the discrepancy from the classical scale. The analysis reduces to the uniform-marginals case via a perturbation and a randomized transform, then employs a dyadic-box decomposition, lattice-box arguments, and Chernoff-type bounds on slice counts to extend the bound to all axis-aligned boxes, achieving the stated discrepancy control with near-linear time and space. A vector-balancing perspective yields an online scheme with dimension-free guarantees, and the final results rely on careful probabilistic bounds for slices and boxes. Overall, the method provides a distribution-free, online thinning technique that significantly tightens two-sample discrepancy with manageable computational resources.

Abstract

The discrepancy between two independent samples and drawn from the same distribution on typically has order \(O(\sqrt{n})\) even in one dimension. We give a simple online algorithm that reduces the discrepancy to \(O(\log^{2d} n)\) by discarding a small fraction of the points.

Paper Structure

This paper contains 5 sections, 8 theorems, 24 equations.

Key Result

Theorem 1

Fix $T$ so that $1 \leqslant T \leqslant \sqrt{n}$. Let $X_1, \dots, X_n$ and $Y_1, \dots, Y_n$ be i.i.d. samples from the same Borel probability distribution on $\mathbb{R}^d$. There is a randomized online algorithm that discards, on average, at most $Cn/T$ of the $X_i$'s and $Y_i$'s, and achieves where $C$ is an absolute constant. Expectations are over both samples and the algorithm.

Theorems & Definitions (11)

  • Theorem 1: Two-sample discrepancy
  • Proposition 1: Sign discrepancy
  • Remark 1: Time and memory cost
  • Remark 2: Open question
  • Remark 3: Dyadic boxes
  • Lemma 1
  • Lemma 2
  • Proposition 2
  • Lemma 3
  • Lemma 4
  • ...and 1 more