Table of Contents
Fetching ...

Variance-Aware Estimation of Kernel Mean Embedding

Geoffrey Wolfer, Pierre Alquier

TL;DR

The paper develops variance-aware convergence bounds for kernel mean embeddings (KMEs) measured by the maximum mean discrepancy (MMD). It shows that the empirical KME deviation $\|\widehat{\mu}_{\mathbb{P}} - \mu_{\mathbb{P}}\|_{\mathcal{H}_k}$ can be tightly bounded by a term that scales with the RKHS variance $v_k(\mathbb{P})$, via $\sqrt{2 v_k(\mathbb{P}) \frac{\log(2/\delta)}{n}}$, plus lower-order terms, and that this bound can be made data-driven by replacing $v_k(\mathbb{P})$ with an empirical proxy $\widehat{v}_k$ for translation-invariant kernels. The authors extend these results to time-dependent data (\phi- and \beta- mixing), formulate empirical-variance Bernstein bounds, and apply them to hypothesis testing (goodness-of-fit and two-sample tests) and robust parametric estimation under Huber contamination, with explicit bounds in the Gaussian location setting and links to parameter-space error via a link function. Overall, the work provides finite-sample, distribution-agnostic improvements to MMD-based inference, enabling faster rates in favorable variance regimes and principled handling of dependent data. The results bridge kernel methods with empirical Bernstein techniques to yield practical, provably tighter confidence bounds and test procedures for KMEs.

Abstract

An important feature of kernel mean embeddings (KME) is that the rate of convergence of the empirical KME to the true distribution KME can be bounded independently of the dimension of the space, properties of the distribution and smoothness features of the kernel. We show how to speed-up convergence by leveraging variance information in the reproducing kernel Hilbert space. Furthermore, we show that even when such information is a priori unknown, we can efficiently estimate it from the data, recovering the desiderata of a distribution agnostic bound that enjoys acceleration in fortuitous settings. We further extend our results from independent data to stationary mixing sequences and illustrate our methods in the context of hypothesis testing and robust parametric estimation.

Variance-Aware Estimation of Kernel Mean Embedding

TL;DR

The paper develops variance-aware convergence bounds for kernel mean embeddings (KMEs) measured by the maximum mean discrepancy (MMD). It shows that the empirical KME deviation can be tightly bounded by a term that scales with the RKHS variance , via , plus lower-order terms, and that this bound can be made data-driven by replacing with an empirical proxy for translation-invariant kernels. The authors extend these results to time-dependent data (\phi- and \beta- mixing), formulate empirical-variance Bernstein bounds, and apply them to hypothesis testing (goodness-of-fit and two-sample tests) and robust parametric estimation under Huber contamination, with explicit bounds in the Gaussian location setting and links to parameter-space error via a link function. Overall, the work provides finite-sample, distribution-agnostic improvements to MMD-based inference, enabling faster rates in favorable variance regimes and principled handling of dependent data. The results bridge kernel methods with empirical Bernstein techniques to yield practical, provably tighter confidence bounds and test procedures for KMEs.

Abstract

An important feature of kernel mean embeddings (KME) is that the rate of convergence of the empirical KME to the true distribution KME can be bounded independently of the dimension of the space, properties of the distribution and smoothness features of the kernel. We show how to speed-up convergence by leveraging variance information in the reproducing kernel Hilbert space. Furthermore, we show that even when such information is a priori unknown, we can efficiently estimate it from the data, recovering the desiderata of a distribution agnostic bound that enjoys acceleration in fortuitous settings. We further extend our results from independent data to stationary mixing sequences and illustrate our methods in the context of hypothesis testing and robust parametric estimation.
Paper Structure (46 sections, 23 theorems, 193 equations, 5 figures)

This paper contains 46 sections, 23 theorems, 193 equations, 5 figures.

Key Result

Theorem 2.1

Let $X_1, \dots, X_n \stackrel{iid}{\sim} \mathbb{P}$, $k \colon \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ be a reproducing kernel, and let With probability at least $1 - \delta$, it holds that where $\overline{k} = \sup_{x \in\mathcal{X}} k(x,x)$. In particular, with probability at least $1 - \delta$, it holds that with

Figures (5)

  • Figure 1: Comparison of the test based on the Bernstein empirical (EmpBer) bound, versus the test based on McDiarmid bound (McDia), and the test based on the Monte-Carlo estimation of the quantile $q_{1-\alpha}$. Frequency of rejection of $\mathbf{H}_0:\mathbb{P}\in\{\mathcal{N}((1,1), I_2)\}$ as a function of $\sigma$ with $\mathbb{P}= \mathcal{N}(0,\sigma^2 I_2)$.
  • Figure 2: Comparison of the test based on the Bernstein empirical (EmpBer) bound, versus the test based on McDiarmid bound (McDia). Frequency of rejection of $\mathbf{H}_0:\mathbb{P}\in\{\mathcal{N}(\theta, I_2),\theta\in\mathbb{R}^2\}$ as a function of $\sigma$ with $\mathbb{P}= \mathcal{N}(0,\sigma^2 I_2)$.
  • Figure 3: Comparison of the bounds in \ref{['eq:bound-theta-huber-contamination']} versus cherief2022finite as a function of $\gamma$.
  • Figure 4: Comparison of cherief2022finite and \ref{['eq:bound-theta-huber-contamination']} for the optimal hyper-parameter $\gamma$ as a function of $n$.
  • Figure 5: Comparison of the bounds in \ref{['eq:bound-theta-huber-contamination']} versus cherief2022finite as a function of $\gamma$, versus the empirical MSE on $50$ experiments, $n=500$, log-scale, $\xi=0$, $\delta=0.05$. Right panel: we zoom on the MSE.

Theorems & Definitions (39)

  • Theorem 2.1: Variance-aware confidence interval
  • Remark 2.1
  • Remark 2.2
  • Remark 2.3: Sharpness of the constants
  • Example 2.1: Gaussian location model with known variance
  • Remark 2.4
  • Lemma 2.1
  • Definition 3.1: maurer2006concentrationboucheron2009concentration
  • Lemma 3.1
  • Lemma 3.2: Unbiasedness
  • ...and 29 more