Variance-Aware Estimation of Kernel Mean Embedding

Geoffrey Wolfer; Pierre Alquier

Variance-Aware Estimation of Kernel Mean Embedding

Geoffrey Wolfer, Pierre Alquier

TL;DR

The paper develops variance-aware convergence bounds for kernel mean embeddings (KMEs) measured by the maximum mean discrepancy (MMD). It shows that the empirical KME deviation $\|\widehat{\mu}_{\mathbb{P}} - \mu_{\mathbb{P}}\|_{\mathcal{H}_k}$ can be tightly bounded by a term that scales with the RKHS variance $v_k(\mathbb{P})$, via $\sqrt{2 v_k(\mathbb{P}) \frac{\log(2/\delta)}{n}}$, plus lower-order terms, and that this bound can be made data-driven by replacing $v_k(\mathbb{P})$ with an empirical proxy $\widehat{v}_k$ for translation-invariant kernels. The authors extend these results to time-dependent data (\phi- and \beta- mixing), formulate empirical-variance Bernstein bounds, and apply them to hypothesis testing (goodness-of-fit and two-sample tests) and robust parametric estimation under Huber contamination, with explicit bounds in the Gaussian location setting and links to parameter-space error via a link function. Overall, the work provides finite-sample, distribution-agnostic improvements to MMD-based inference, enabling faster rates in favorable variance regimes and principled handling of dependent data. The results bridge kernel methods with empirical Bernstein techniques to yield practical, provably tighter confidence bounds and test procedures for KMEs.

Abstract

An important feature of kernel mean embeddings (KME) is that the rate of convergence of the empirical KME to the true distribution KME can be bounded independently of the dimension of the space, properties of the distribution and smoothness features of the kernel. We show how to speed-up convergence by leveraging variance information in the reproducing kernel Hilbert space. Furthermore, we show that even when such information is a priori unknown, we can efficiently estimate it from the data, recovering the desiderata of a distribution agnostic bound that enjoys acceleration in fortuitous settings. We further extend our results from independent data to stationary mixing sequences and illustrate our methods in the context of hypothesis testing and robust parametric estimation.

Variance-Aware Estimation of Kernel Mean Embedding

TL;DR

The paper develops variance-aware convergence bounds for kernel mean embeddings (KMEs) measured by the maximum mean discrepancy (MMD). It shows that the empirical KME deviation

can be tightly bounded by a term that scales with the RKHS variance

, via

, plus lower-order terms, and that this bound can be made data-driven by replacing

with an empirical proxy

for translation-invariant kernels. The authors extend these results to time-dependent data (\phi- and \beta- mixing), formulate empirical-variance Bernstein bounds, and apply them to hypothesis testing (goodness-of-fit and two-sample tests) and robust parametric estimation under Huber contamination, with explicit bounds in the Gaussian location setting and links to parameter-space error via a link function. Overall, the work provides finite-sample, distribution-agnostic improvements to MMD-based inference, enabling faster rates in favorable variance regimes and principled handling of dependent data. The results bridge kernel methods with empirical Bernstein techniques to yield practical, provably tighter confidence bounds and test procedures for KMEs.

Abstract

Paper Structure (46 sections, 23 theorems, 193 equations, 5 figures)

This paper contains 46 sections, 23 theorems, 193 equations, 5 figures.

Introduction
Notation and Background
Related Work
Main Contributions
Outline
Variance-Aware Convergence Rates
Gaussian Kernel
Convex Radial Square Basis Functions
Positive Definitive Matrix on the Finite Space
Convergence Rates with Empirical Variance Proxy
Intuition in the Hypercube
Systematic Approach
Computability of the Empirical Variance Proxy
Convergence Rates for the Difference of Two Means
First Approach: Estimation by a U-Statistic
...and 31 more sections

Key Result

Theorem 2.1

Let $X_1, \dots, X_n \stackrel{iid}{\sim} \mathbb{P}$, $k \colon \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ be a reproducing kernel, and let With probability at least $1 - \delta$, it holds that where $\overline{k} = \sup_{x \in\mathcal{X}} k(x,x)$. In particular, with probability at least $1 - \delta$, it holds that with

Figures (5)

Figure 1: Comparison of the test based on the Bernstein empirical (EmpBer) bound, versus the test based on McDiarmid bound (McDia), and the test based on the Monte-Carlo estimation of the quantile $q_{1-\alpha}$. Frequency of rejection of $\mathbf{H}_0:\mathbb{P}\in\{\mathcal{N}((1,1), I_2)\}$ as a function of $\sigma$ with $\mathbb{P}= \mathcal{N}(0,\sigma^2 I_2)$.
Figure 2: Comparison of the test based on the Bernstein empirical (EmpBer) bound, versus the test based on McDiarmid bound (McDia). Frequency of rejection of $\mathbf{H}_0:\mathbb{P}\in\{\mathcal{N}(\theta, I_2),\theta\in\mathbb{R}^2\}$ as a function of $\sigma$ with $\mathbb{P}= \mathcal{N}(0,\sigma^2 I_2)$.
Figure 3: Comparison of the bounds in \ref{['eq:bound-theta-huber-contamination']} versus cherief2022finite as a function of $\gamma$.
Figure 4: Comparison of cherief2022finite and \ref{['eq:bound-theta-huber-contamination']} for the optimal hyper-parameter $\gamma$ as a function of $n$.
Figure 5: Comparison of the bounds in \ref{['eq:bound-theta-huber-contamination']} versus cherief2022finite as a function of $\gamma$, versus the empirical MSE on $50$ experiments, $n=500$, log-scale, $\xi=0$, $\delta=0.05$. Right panel: we zoom on the MSE.

Theorems & Definitions (39)

Theorem 2.1: Variance-aware confidence interval
Remark 2.1
Remark 2.2
Remark 2.3: Sharpness of the constants
Example 2.1: Gaussian location model with known variance
Remark 2.4
Lemma 2.1
Definition 3.1: maurer2006concentrationboucheron2009concentration
Lemma 3.1
Lemma 3.2: Unbiasedness
...and 29 more

Variance-Aware Estimation of Kernel Mean Embedding

TL;DR

Abstract

Variance-Aware Estimation of Kernel Mean Embedding

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (39)