Table of Contents
Fetching ...

A Martingale Kernel Two-Sample Test

Anirban Chatterjee, Aaditya Ramdas

TL;DR

The paper introduces the martingale MMD (mMMD) test for two-sample problems, defining a quadratic-time statistic that, under the null, converges to a standard normal distribution without resampling. By constructing a martingale difference sequence through past observations to estimate the witness function, the authors achieve a calibration-free test with strong theoretical guarantees including asymptotic normality under varying kernels, dimensions, and data distributions, plus consistency against fixed and broad classes of alternatives. They demonstrate competitive empirical performance against the classic quadratic-time MMD and other scalable methods on simulated and real data (e.g., MNIST), while maintaining a favorable computational profile of $O(n^2)$. Extensions include a multi-kernel version (mmMMD) with a $\chi^2$ null and a generalized family $T_{n,\gamma}$ that interpolates between MMD and mMMD, with partial minimax optimality results. Overall, the work provides a practical, theoretically solid, resampling-free approach to kernel two-sample testing with scalable accuracy and broad applicability.

Abstract

The Maximum Mean Discrepancy (MMD) is a widely used multivariate distance metric for two-sample testing. The standard MMD test statistic has an intractable null distribution typically requiring costly resampling or permutation approaches for calibration. In this work we leverage a martingale interpretation of the estimated squared MMD to propose martingale MMD (mMMD), a quadratic-time statistic which has a limiting standard Gaussian distribution under the null. Moreover we show that the test is consistent against any fixed alternative and for large sample sizes, mMMD offers substantial computational savings over the standard MMD test, with only a minor loss in power.

A Martingale Kernel Two-Sample Test

TL;DR

The paper introduces the martingale MMD (mMMD) test for two-sample problems, defining a quadratic-time statistic that, under the null, converges to a standard normal distribution without resampling. By constructing a martingale difference sequence through past observations to estimate the witness function, the authors achieve a calibration-free test with strong theoretical guarantees including asymptotic normality under varying kernels, dimensions, and data distributions, plus consistency against fixed and broad classes of alternatives. They demonstrate competitive empirical performance against the classic quadratic-time MMD and other scalable methods on simulated and real data (e.g., MNIST), while maintaining a favorable computational profile of . Extensions include a multi-kernel version (mmMMD) with a null and a generalized family that interpolates between MMD and mMMD, with partial minimax optimality results. Overall, the work provides a practical, theoretically solid, resampling-free approach to kernel two-sample testing with scalable accuracy and broad applicability.

Abstract

The Maximum Mean Discrepancy (MMD) is a widely used multivariate distance metric for two-sample testing. The standard MMD test statistic has an intractable null distribution typically requiring costly resampling or permutation approaches for calibration. In this work we leverage a martingale interpretation of the estimated squared MMD to propose martingale MMD (mMMD), a quadratic-time statistic which has a limiting standard Gaussian distribution under the null. Moreover we show that the test is consistent against any fixed alternative and for large sample sizes, mMMD offers substantial computational savings over the standard MMD test, with only a minor loss in power.

Paper Structure

This paper contains 43 sections, 17 theorems, 163 equations, 7 figures.

Key Result

theorem 1

Take $\mathcal{X} = \mathbb{R}^d$ for some $d\geq 1$. Suppose the kernel $\mathsf{K}$ satisfies Assumption assumption:K, and let $P\in \mathcal{M}_{\mathsf{K}}^{1/2}$. Moreover assume $\mathbb{E}\left[\bar{\mathsf{K}}\left(X_1,X_2\right)^4\right]<\infty$ for $X_1,X_2$ generated independently from $P

Figures (7)

  • Figure 1: The $m\mathrm{MMD}$-test: (a) Empirical distribution of the $m\mathrm{MMD}$-test statistic (see \ref{['eq:def_eta_n']} for a formal definition) under $\bm{H}_0$ with $d = 20,200$ and $P = Q$ are $d$-dimensional standard Gaussian distributions. (b) The second figure compares power of the $m\mathrm{MMD}$ test with the quadratic time MMD test (implemented with $200$ permutations) from gretton2012kernel and computationally efficient variants from gretton2012kernel, zaremba2013b and shekhar2022permutation. The main takeaway is that the MMD is the most powerful but most computationally inefficient. The $x\mathrm{MMD}$ and $m\mathrm{MMD}$ are more powerful than the others, but the latter avoids sample splitting. (c) The third figure shows the computational efficiency of the proposed test against the permutation-based quadratic time MMD test. All tests use the Gaussian kernel with median bandwidth.
  • Figure 2: A visual illustration of the main difference in computing the quadratic-time MMD (MMD), our proposed mMMD, the block-MMD (BMMD) from zaremba2013b, and the cross-MMD (xMMD) from shekhar2022permutation. The highlighted regions indicate the sample pairs used in the computation. The figure shows that the quadratic-time MMD considers all off-diagonal pairwise kernel evaluations between the combined samples. BMMD partitions the data into blocks and averages the pairwise kernel evaluations within each block. xMMD splits the samples into two halves and evaluates kernels only across the splits. In contrast, our proposed mMMD computes pairwise kernel values by taking the lower triangular part of within-sample and between-sample blocks. However, we emphasize that, due to the symmetry of the kernel, the terms used by MMD and mMMD are equivalent. The principal distinction between the two lies in the normalization procedure: mMMD computes a row-wise mean followed by an average across rows (see \ref{['eq:Tn_alt']}), whereas MMD applies a global mean over all elements.
  • Figure 3: Empirical null distribution of the $m$-MMD test statistic $\eta_n$ from equation \ref{['eq:def_eta_n']}. The left two figures use a Gaussian kernel with the median heuristic, while the right two figures use a Laplace kernel, also with the median heuristic. The underlying data distribution is $\mathrm{N}(\mathbf{0}_d, \mathbf{I}_d)$.
  • Figure 4: Empirical null distribution of the $m$-MMD test statistic $\eta_n$ from equation \ref{['eq:def_eta_n']}. The left two figures use a Gaussian kernel with the median heuristic, while the right two use a Laplace kernel, also with the median heuristic. The underlying data distribution is $t_d(10)$.
  • Figure 5: Empirical power of the $m$-MMD test compared to the original quadratic-time MMD, the linear-time LMMD, the block-based BMMD, and the cross-MMD (xMMD) tests. From left to right, the figures correspond to $(d, j, \varepsilon) = (10, 5, 0.3)$, $(50, 5, 0.4)$, and $(100, 5, 0.5)$.
  • ...and 2 more figures

Theorems & Definitions (29)

  • remark 1
  • theorem 1
  • theorem 2
  • remark 2
  • remark 3
  • theorem 3
  • theorem 4
  • theorem 5
  • theorem 6
  • remark 4
  • ...and 19 more