Table of Contents
Fetching ...

Kernel Two-Sample Tests for Manifold Data

Xiuyuan Cheng, Yao Xie

TL;DR

This paper develops non-asymptotic theory for kernel-based two-sample tests applied to data lying on or near low-dimensional manifolds, showing that the test power depends on intrinsic dimension $d$, Hölder smoothness $\beta$, and the $L^2$-divergence $\Delta_2$ between the densities. A key result is that with bandwidth $\gamma$ scaled as $n^{-1/(d+4\beta)}$, the detection rate satisfies $\Delta_2 \gtrsim n^{-2\beta/(d+4\beta)}$, enabling the test to overcome the curse of dimensionality on manifolds; the theory remains valid for non-PSD kernels as well. The authors extend the framework to manifolds with boundary and to data corrupted by additive noise, showing that the same finite-sample guarantees hold under reasonable conditions, including a near-boundary belt argument and Gaussian noise bounds. They validate the theory through numerical experiments on synthetic manifold data and the MNIST dataset, demonstrating that smaller bandwidths than the median distance can improve power when the intrinsic dimension is low and samples are plentiful, and that non-PSD kernels can still be effective. The work suggests a broader class of kernel-based tests for manifold data and informs bandwidth selection strategies to exploit intrinsic structure in high-dimensional settings.

Abstract

We present a study of a kernel-based two-sample test statistic related to the Maximum Mean Discrepancy (MMD) in the manifold data setting, assuming that high-dimensional observations are close to a low-dimensional manifold. We characterize the test level and power in relation to the kernel bandwidth, the number of samples, and the intrinsic dimensionality of the manifold. Specifically, when data densities $p$ and $q$ are supported on a $d$-dimensional sub-manifold ${M}$ embedded in an $m$-dimensional space and are Hölder with order $β$ (up to 2) on ${M}$, we prove a guarantee of the test power for finite sample size $n$ that exceeds a threshold depending on $d$, $β$, and $Δ_2$ the squared $L^2$-divergence between $p$ and $q$ on the manifold, and with a properly chosen kernel bandwidth $γ$. For small density departures, we show that with large $n$ they can be detected by the kernel test when $Δ_2$ is greater than $n^{- { 2 β/( d + 4 β) }}$ up to a certain constant and $γ$ scales as $n^{-1/(d+4β)}$. The analysis extends to cases where the manifold has a boundary and the data samples contain high-dimensional additive noise. Our results indicate that the kernel two-sample test has no curse-of-dimensionality when the data lie on or near a low-dimensional manifold. We validate our theory and the properties of the kernel test for manifold data through a series of numerical experiments.

Kernel Two-Sample Tests for Manifold Data

TL;DR

This paper develops non-asymptotic theory for kernel-based two-sample tests applied to data lying on or near low-dimensional manifolds, showing that the test power depends on intrinsic dimension , Hölder smoothness , and the -divergence between the densities. A key result is that with bandwidth scaled as , the detection rate satisfies , enabling the test to overcome the curse of dimensionality on manifolds; the theory remains valid for non-PSD kernels as well. The authors extend the framework to manifolds with boundary and to data corrupted by additive noise, showing that the same finite-sample guarantees hold under reasonable conditions, including a near-boundary belt argument and Gaussian noise bounds. They validate the theory through numerical experiments on synthetic manifold data and the MNIST dataset, demonstrating that smaller bandwidths than the median distance can improve power when the intrinsic dimension is low and samples are plentiful, and that non-PSD kernels can still be effective. The work suggests a broader class of kernel-based tests for manifold data and informs bandwidth selection strategies to exploit intrinsic structure in high-dimensional settings.

Abstract

We present a study of a kernel-based two-sample test statistic related to the Maximum Mean Discrepancy (MMD) in the manifold data setting, assuming that high-dimensional observations are close to a low-dimensional manifold. We characterize the test level and power in relation to the kernel bandwidth, the number of samples, and the intrinsic dimensionality of the manifold. Specifically, when data densities and are supported on a -dimensional sub-manifold embedded in an -dimensional space and are Hölder with order (up to 2) on , we prove a guarantee of the test power for finite sample size that exceeds a threshold depending on , , and the squared -divergence between and on the manifold, and with a properly chosen kernel bandwidth . For small density departures, we show that with large they can be detected by the kernel test when is greater than up to a certain constant and scales as . The analysis extends to cases where the manifold has a boundary and the data samples contain high-dimensional additive noise. Our results indicate that the kernel two-sample test has no curse-of-dimensionality when the data lie on or near a low-dimensional manifold. We validate our theory and the properties of the kernel test for manifold data through a series of numerical experiments.

Paper Structure

This paper contains 26 sections, 8 theorems, 127 equations, 5 figures, 1 table.

Key Result

Lemma 3.1

Suppose $\mathcal{M}$ satisfies Assumption assump:M, $h$ satisfies Assumption assump:h-C1, and $f$ is in $\mathcal{H}^\beta(\mathcal{M})$, $0< \beta \le 2$, with Hölder constant $L_f$. Then there is $\gamma_0> 0$ which depends on $\mathcal{M}$ only, and constant $C_1$ that depends on $(\mathcal{M},h Specifically, $\gamma_0$ depends on manifold reach and curvature, and $C_1 > 0$ depends on manifold

Figures (5)

  • Figure 1: (Left) A one-dimensional manifold with no boundary (a closed curve) embedded in $\mathbb{R}^3$, and an Euclidean ball centered at a point $x$ on the manifold. (Right) Illustration of a two-dimensional manifold with boundary, showing the near-boundary set $P_\gamma$ (gray belt), and two Euclidean balls centered at a point away from the boundary and another point on the boundary, respectively.
  • Figure 2: An example showing that the increase in the ambient dimension $m$ does not affect the intrinsic dimensionality $d$ nor the intrinsic geometry of the manifold data. An image of hand-written digit "8" is rotated by angles $z$ and at different image sizes. The images for changing angle $z$ lie on a one-dimensional manifold in the ambient space and approach a certain continuous limit as image resolution refines. The group element $z$ has two distributions, which induce two distributions of data images in ambient space $\mathbb{R}^m$. When $z$ changes from 0 to $2\pi$ the curve is closed and the data manifold has no boundary. When $z$ changes from 0 to $\pi /2$ the curve has two endpoints and the data manifold has a boundary. The two-sample test results on this data are provided in Section \ref{['sec:experiments']}.
  • Figure 3: An example of simulated manifold data on which the kernel test power does not drop as the ambient dimension $m$ increases, where the intrinsic dimension $d$ remains constant. Gaussian kernel test statistics are computed on two datasets of rotated images with different distributions of rotation angles. Images are of sizes 10$\times$10, $\cdots$, 40$\times$40, and thus $m$ increases from 100 to 1600. The test is computed with 5 values of kernel bandwidth as in \ref{['eq:5-bandwidth']}, and that is chosen by the median distance from the data. The test power is estimated using $n_{\rm run} = 500$. (Left) Results on clean images. (Middle) Results on images with additive Gaussian noise, where the noise level is chosen to be small and satisfies the condition in Section \ref{['subsec:manifold+noise']}. (Right) Example clean and noisy images (size 30$\times$30).
  • Figure 4: Kernel two-sample test to detect a local density departure of the MNIST image data distributions. (Top left) Datasets $X$ and $Y$ are visualized in 2D by tSNE, colored by 10-digit class labels. (Top middle and right) Kernel test statistic $\widehat{T}$ (red cross) plotted against the histogram of test statistic under $H_0$ computed by bootstrap arcones1992bootstrap (blue bar, see more in Appendix \ref{['appA']}. The middle plot is for the Gaussian kernel test using median distance $\gamma$, and the right plot is by using a smaller $\gamma$. (Bottom left) The local cohort density $p_{\rm cohort}$ is illustrated by red dots. (Bottom middle and right) The witness function defined in \ref{['eq:def-witness']} for kernel using median distance $\gamma$ and a smaller bandwidth, respectively.
  • Figure 5: Two-sample test with non-Gaussian kernels that may not be PSD. (Left column) Three choices of kernel function $h$ as in Section \ref{['subsec:exp-nonPSD']}. (Right two columns) Same plots of testing powers on clean and noisy data of rotated images as in Figure \ref{['fig:mnist-1']}, of the three kernel functions respectively.

Theorems & Definitions (23)

  • Example 1: Manifold data with increasing $m$
  • Lemma 3.1: Kernel integral on manifold
  • Lemma 3.2
  • Proposition 3.4: Control of $|\widehat{T} - T|$
  • Theorem 3.5: Power of kernel test
  • Remark 3.1: Constant $m_0$
  • Example 2: Constants for Gaussian $h$
  • Corollary 3.6: Rate-for-detection
  • Remark 3.2: Choice of bandwidth
  • Remark 3.3: Higher Hölder regularity
  • ...and 13 more