Compress Then Test: Powerful Kernel Testing in Near-linear Time

Carles Domingo-Enrich; Raaz Dwivedi; Lester Mackey

Compress Then Test: Powerful Kernel Testing in Near-linear Time

Carles Domingo-Enrich, Raaz Dwivedi, Lester Mackey

TL;DR

The paper addresses the computational bottleneck of kernel two-sample testing by introducing Compress Then Test (CTT), which compresses each input sample into small coresets and then performs a permutation-based MMD test on the compressed data. It proves that, under subexponential tails, CTT preserves the quadratic-time detection boundary with near-linear runtime, and it strengthens permutation tests with coarse-grained core-permutation analyses. The authors further extend CTT with Low-Rank CTT (LR-CTT) to leverage low-rank kernel approximations, and Aggregated CTT (ACTT) to select among multiple kernels efficiently. Across extensive experiments, CTT, LR-CTT, and ACTT achieve 20–200x speed-ups with no loss of power compared to state-of-the-art subsampling methods, making powerful kernel testing scalable to large datasets.

Abstract

Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on $n$ sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each $n$ point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.

Compress Then Test: Powerful Kernel Testing in Near-linear Time

TL;DR

Abstract

Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on

sample points. However, existing kernel tests either run in

time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each

point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.

Paper Structure (43 sections, 24 theorems, 201 equations, 10 figures, 2 tables)

This paper contains 43 sections, 24 theorems, 201 equations, 10 figures, 2 tables.

Introduction
Kernel Two-sample Testing
Compress Then Test
MMD compression with CoresetMMD
Compress Then Test
CTT Extensions
Low-Rank CTT
Aggregated CTT
Experiments
Connections and Conclusions
Background on KT-Compress
Proof of \ref{['thm:compression_guarantee']}: \ref{['thm:compression_guarantee']}
On the KT-Compress error inflation factor
Proof of claim \ref{['eq:mmd_diff_x_y']}
Proof of claim \ref{['eq:mmd_diff_p_q']}
...and 28 more sections

Key Result

lemma 1

The CoresetMMD estimate tmmd satisfiesUnless otherwise specified, all of our results refer to an arbitrary setting of an algorithm's input arguments. with probability at least $1\!-\!\delta$ conditional on $(\mathbb{X}_{m},\mathbb{Y}_{n})$, and with probability at least $1\!-\!3\delta$ for $c_{\delta}\!\triangleq\! 2\!+\!\sqrt{2\log(\frac{2}{\delta}})$.

Figures (10)

Figure 1: Time-power trade-off curves in the Gaussian and EMNIST experimental settings comparing (left) CTT to five state-of-the-art approximate MMD tests based on subsampling and (right) LR-CTT to the state-of-the-art low-rank MMD test based on random Fourier features (RFF).
Figure 2: Time-power trade-off curves for ACTT and the state-of-the-art incomplete MMD aggregation test in the Blobs and Higgs experimental settings.
Figure 3: KT-Compress -- Identify coreset of size $2^\mathfrak{g}\xspace\sqrt{n}$
Figure 4: OptHalve4 -- Optimal four-point halving
Figure 5: kt-split -- Divide points into candidate coresets of size $\lfloor n/2 \rfloor$
...and 5 more figures

Theorems & Definitions (43)

lemma 1: Quality of CoresetMMD
remark 1: Beyond i.i.d. data
remark 2: Beyond KT-Compress
proposition 1: Finite-sample exactness of CTT
remark 3: Exchangeability
theorem 1: Power of CTT
remark 4: Valid parameter values
proposition 2: Power upper bounds for complete, block, and incomplete MMD tests
theorem 2: LR-CTT exactness and power
theorem 3: $\textup{ACTT}\xspace$ validity and power
...and 33 more

Compress Then Test: Powerful Kernel Testing in Near-linear Time

TL;DR

Abstract

Compress Then Test: Powerful Kernel Testing in Near-linear Time

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (43)