Table of Contents
Fetching ...

Compress Then Test: Powerful Kernel Testing in Near-linear Time

Carles Domingo-Enrich, Raaz Dwivedi, Lester Mackey

TL;DR

The paper addresses the computational bottleneck of kernel two-sample testing by introducing Compress Then Test (CTT), which compresses each input sample into small coresets and then performs a permutation-based MMD test on the compressed data. It proves that, under subexponential tails, CTT preserves the quadratic-time detection boundary with near-linear runtime, and it strengthens permutation tests with coarse-grained core-permutation analyses. The authors further extend CTT with Low-Rank CTT (LR-CTT) to leverage low-rank kernel approximations, and Aggregated CTT (ACTT) to select among multiple kernels efficiently. Across extensive experiments, CTT, LR-CTT, and ACTT achieve 20–200x speed-ups with no loss of power compared to state-of-the-art subsampling methods, making powerful kernel testing scalable to large datasets.

Abstract

Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on $n$ sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each $n$ point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.

Compress Then Test: Powerful Kernel Testing in Near-linear Time

TL;DR

The paper addresses the computational bottleneck of kernel two-sample testing by introducing Compress Then Test (CTT), which compresses each input sample into small coresets and then performs a permutation-based MMD test on the compressed data. It proves that, under subexponential tails, CTT preserves the quadratic-time detection boundary with near-linear runtime, and it strengthens permutation tests with coarse-grained core-permutation analyses. The authors further extend CTT with Low-Rank CTT (LR-CTT) to leverage low-rank kernel approximations, and Aggregated CTT (ACTT) to select among multiple kernels efficiently. Across extensive experiments, CTT, LR-CTT, and ACTT achieve 20–200x speed-ups with no loss of power compared to state-of-the-art subsampling methods, making powerful kernel testing scalable to large datasets.

Abstract

Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on sample points. However, existing kernel tests either run in time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.
Paper Structure (43 sections, 24 theorems, 201 equations, 10 figures, 2 tables)

This paper contains 43 sections, 24 theorems, 201 equations, 10 figures, 2 tables.

Key Result

lemma 1

The CoresetMMD estimate tmmd satisfiesUnless otherwise specified, all of our results refer to an arbitrary setting of an algorithm's input arguments. with probability at least $1\!-\!\delta$ conditional on $(\mathbb{X}_{m},\mathbb{Y}_{n})$, and with probability at least $1\!-\!3\delta$ for $c_{\delta}\!\triangleq\! 2\!+\!\sqrt{2\log(\frac{2}{\delta}})$.

Figures (10)

  • Figure 1: Time-power trade-off curves in the Gaussian and EMNIST experimental settings comparing (left) CTT to five state-of-the-art approximate MMD tests based on subsampling and (right) LR-CTT to the state-of-the-art low-rank MMD test based on random Fourier features (RFF).
  • Figure 2: Time-power trade-off curves for ACTT and the state-of-the-art incomplete MMD aggregation test in the Blobs and Higgs experimental settings.
  • Figure 3: KT-Compress -- Identify coreset of size $2^\mathfrak{g}\xspace\sqrt{n}$
  • Figure 4: OptHalve4 -- Optimal four-point halving
  • Figure 5: kt-split -- Divide points into candidate coresets of size $\lfloor n/2 \rfloor$
  • ...and 5 more figures

Theorems & Definitions (43)

  • lemma 1: Quality of CoresetMMD
  • remark 1: Beyond i.i.d. data
  • remark 2: Beyond KT-Compress
  • proposition 1: Finite-sample exactness of CTT
  • remark 3: Exchangeability
  • theorem 1: Power of CTT
  • remark 4: Valid parameter values
  • proposition 2: Power upper bounds for complete, block, and incomplete MMD tests
  • theorem 2: LR-CTT exactness and power
  • theorem 3: $\textup{ACTT}\xspace$ validity and power
  • ...and 33 more