Table of Contents
Fetching ...

Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features

Ikjun Choi, Ilmun Kim

TL;DR

This work analyzes random Fourier feature (RFF) approximations for kernel two-sample testing via MMD under permutation tests. It proves that with a fixed number of random features, RFF-MMD can be pointwise inconsistent, but consistency is recovered once the number of features grows with the sample size, enabling sub-quadratic-time tests to achieve minimax separation rates under Sobolev-smooth alternatives. The authors derive uniform consistency results: for L2/Sobolev alternatives the minimax rate is n^{-2s/(4s+d)} with R ≥ n^{4d/(4s+d)}, and for the MMD metric the rate is n^{-1/2} with R ∼ n, with a linear-time regime possible for Gaussian subclasses. Numerical studies corroborate the theory, showing that RFF-MMD can match the quadratic-time MMD power with moderate R and exhibit linear-time scaling, offering a practical path to scalable, powerful kernel two-sample tests. Overall, the paper delineates how to balance computational efficiency and statistical power in RFF-based kernel testing and provides guidance for kernel choices and feature budgets in practice.

Abstract

Recent years have seen a surge in methods for two-sample testing, among which the Maximum Mean Discrepancy (MMD) test has emerged as an effective tool for handling complex and high-dimensional data. Despite its success and widespread adoption, the primary limitation of the MMD test has been its quadratic-time complexity, which poses challenges for large-scale analysis. While various approaches have been proposed to expedite the procedure, it has been unclear whether it is possible to attain the same power guarantee as the MMD test at sub-quadratic time cost. To fill this gap, we revisit the approximated MMD test using random Fourier features, and investigate its computational-statistical trade-off. We start by revealing that the approximated MMD test is pointwise consistent in power only when the number of random features approaches infinity. We then consider the uniform power of the test and study the time-power trade-off under the minimax testing framework. Our result shows that, by carefully choosing the number of random features, it is possible to attain the same minimax separation rates as the MMD test within sub-quadratic time. We demonstrate this point under different distributional assumptions such as densities in a Sobolev ball. Our theoretical findings are corroborated by simulation studies.

Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features

TL;DR

This work analyzes random Fourier feature (RFF) approximations for kernel two-sample testing via MMD under permutation tests. It proves that with a fixed number of random features, RFF-MMD can be pointwise inconsistent, but consistency is recovered once the number of features grows with the sample size, enabling sub-quadratic-time tests to achieve minimax separation rates under Sobolev-smooth alternatives. The authors derive uniform consistency results: for L2/Sobolev alternatives the minimax rate is n^{-2s/(4s+d)} with R ≥ n^{4d/(4s+d)}, and for the MMD metric the rate is n^{-1/2} with R ∼ n, with a linear-time regime possible for Gaussian subclasses. Numerical studies corroborate the theory, showing that RFF-MMD can match the quadratic-time MMD power with moderate R and exhibit linear-time scaling, offering a practical path to scalable, powerful kernel two-sample tests. Overall, the paper delineates how to balance computational efficiency and statistical power in RFF-based kernel testing and provides guidance for kernel choices and feature budgets in practice.

Abstract

Recent years have seen a surge in methods for two-sample testing, among which the Maximum Mean Discrepancy (MMD) test has emerged as an effective tool for handling complex and high-dimensional data. Despite its success and widespread adoption, the primary limitation of the MMD test has been its quadratic-time complexity, which poses challenges for large-scale analysis. While various approaches have been proposed to expedite the procedure, it has been unclear whether it is possible to attain the same power guarantee as the MMD test at sub-quadratic time cost. To fill this gap, we revisit the approximated MMD test using random Fourier features, and investigate its computational-statistical trade-off. We start by revealing that the approximated MMD test is pointwise consistent in power only when the number of random features approaches infinity. We then consider the uniform power of the test and study the time-power trade-off under the minimax testing framework. Our result shows that, by carefully choosing the number of random features, it is possible to attain the same minimax separation rates as the MMD test within sub-quadratic time. We demonstrate this point under different distributional assumptions such as densities in a Sobolev ball. Our theoretical findings are corroborated by simulation studies.
Paper Structure (31 sections, 17 theorems, 266 equations, 2 figures, 1 table)

This paper contains 31 sections, 17 theorems, 266 equations, 2 figures, 1 table.

Key Result

Lemma 1

Let $R \in \mathbb N$ be a fixed number and let $\boldsymbol{\omega}_R=\{\omega_r\}^R_{r=1}$ be a sequence of real-valued i.i.d. random vectors from a probability distribution on $\mathbb R^d$ which is absolutely continuous with respect to the Lebesgue measure. For arbitrary $\epsilon \in (0,1)$, th

Figures (2)

  • Figure 1: Power experiments with two different settings: (i) univariate Gaussian distribution, (ii) high-dimensional Gaussian distribution. The sample sizes are set to ${n_1}={n_2}=1000$ for the first row of graphs. For the second row of graphs, parameters are set to $\mu=0.15$ in the first column, $\sigma=1.3$ in the second column, $d=1000$ in the third column, and $\sigma=1.03$ in the fourth column.
  • Figure 2: Power experiments with two different settings: (i) perturbed uniform distribution, (ii) MNIST. The sample sizes are set to ${n_1}={n_2}=1000$ for the first row of graphs. For the second row of graphs, parameters are set to $\alpha=0.6$ in the first column, $\alpha=0.45$ in the second column, and $\gamma=0.1$ in the third and last column.

Theorems & Definitions (18)

  • Lemma 1: chwialkowski2015fast
  • Proposition 2
  • Theorem 3
  • Corollary 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Proposition 8
  • Lemma 9: bochner1933
  • Lemma 10: Bogachev2007
  • ...and 8 more