Table of Contents
Fetching ...

Large sample analysis of the median heuristic

Damien Garreau, Wittawat Jitkrittum, Motonobu Kanagawa

TL;DR

The paper addresses the lack of theoretical understanding of the median heuristic for kernel bandwidth selection in RBF kernels, focusing on kernel two-sample tests via MMD. It develops a large-sample framework showing the empirical median of pairwise squared distances converges to a target mixture and is asymptotically normal under mild conditions, derived through a CLT for non-identically distributed U-statistics. The authors provide exact expressions for the target distribution, prove the CLT for the empirical CDF of pairwise distances, and establish the asymptotic normality of the median itself. Empirically, they compare the median heuristic against power-maximization criteria on Gaussian benchmarks, revealing the median performs comparably in mean-shift settings but may be suboptimal under variance-shift scenarios, with implications for bandwidth choice in practice. The work offers a principled justification for the median heuristic in certain regimes and highlights its limitations, guiding when to favor alternative bandwidth selection strategies in kernel-based tests and analyses.

Abstract

In kernel methods, the median heuristic has been widely used as a way of setting the bandwidth of RBF kernels. While its empirical performances make it a safe choice under many circumstances, there is little theoretical understanding of why this is the case. Our aim in this paper is to advance our understanding of the median heuristic by focusing on the setting of kernel two-sample test. We collect new findings that may be of interest for both theoreticians and practitioners. In theory, we provide a convergence analysis that shows the asymptotic normality of the bandwidth chosen by the median heuristic in the setting of kernel two-sample test. Systematic empirical investigations are also conducted in simple settings, comparing the performances based on the bandwidths chosen by the median heuristic and those by the maximization of test power.

Large sample analysis of the median heuristic

TL;DR

The paper addresses the lack of theoretical understanding of the median heuristic for kernel bandwidth selection in RBF kernels, focusing on kernel two-sample tests via MMD. It develops a large-sample framework showing the empirical median of pairwise squared distances converges to a target mixture and is asymptotically normal under mild conditions, derived through a CLT for non-identically distributed U-statistics. The authors provide exact expressions for the target distribution, prove the CLT for the empirical CDF of pairwise distances, and establish the asymptotic normality of the median itself. Empirically, they compare the median heuristic against power-maximization criteria on Gaussian benchmarks, revealing the median performs comparably in mean-shift settings but may be suboptimal under variance-shift scenarios, with implications for bandwidth choice in practice. The work offers a principled justification for the median heuristic in certain regimes and highlights its limitations, guiding when to favor alternative bandwidth selection strategies in kernel-based tests and analyses.

Abstract

In kernel methods, the median heuristic has been widely used as a way of setting the bandwidth of RBF kernels. While its empirical performances make it a safe choice under many circumstances, there is little theoretical understanding of why this is the case. Our aim in this paper is to advance our understanding of the median heuristic by focusing on the setting of kernel two-sample test. We collect new findings that may be of interest for both theoreticians and practitioners. In theory, we provide a convergence analysis that shows the asymptotic normality of the bandwidth chosen by the median heuristic in the setting of kernel two-sample test. Systematic empirical investigations are also conducted in simple settings, comparing the performances based on the bandwidths chosen by the median heuristic and those by the maximization of test power.

Paper Structure

This paper contains 27 sections, 7 theorems, 80 equations, 6 figures.

Key Result

Lemma 3.1

Set $\mu_X$ (resp. $\mu_Y$) the expectation and $\Sigma_X$ (resp. $\Sigma_Y$) the covariance matrix of $X$ (resp. $Y$). Assume that there exists $\lambda > 75$ such that Then, with probability at least $1-75/\lambda$,

Figures (6)

  • Figure 1: Histogram of the $\left\lVert X_{n,i}-X_{n,j}\right\rVert^2$ with $n=400$ for Gaussian distributions in dimension $d=100$ and proportion $\alpha=.25$. Left panel: change in the mean, $X\sim \mathcal{N}\left(0,\mathop{\mathrm{I}}\nolimits_d\right)$ and $Y\sim\mathcal{N}\left(10^3\mathds{1},\mathop{\mathrm{I}}\nolimits_d\right)$. Right panel: change in the variance, $X\sim \mathcal{N}\left(0,\mathop{\mathrm{I}}\nolimits_d\right)$ and $Y\sim\mathcal{N}\left(0,2\mathop{\mathrm{I}}\nolimits_d\right)$. The error bars correspond to the standard deviation over $10$ repetitions of the experiment.
  • Figure 2: Gaussian kernel bandwidth selected by different means. The left panel corresponds to the Mean scenario, where we vary the mean $\mu$ of $Q$. The right panel corresponds to the Var scenario, where we vary the variance $\sigma^2$ of $Q$. For each plot, the vertical axis depicts the values of selected bandwidths. The black curves are obtained by computing the theoretical median, the red curves by maximizing $R_u$ (the power criterion with quadratic-time MMD), and the red dotted curves by maximizing $R_{\ell}$ (the power criterion with linear-time MMD).
  • Figure 3: In this figure we plot the ABS of the different tests in the Mean (left panel) and Var (right panel) scenarios. The black lines correspond to the $\text{MMD}_u$ test with the median heuristic, the red lines to the $\text{MMD}_u$ test with bandwidth chosen by maximizing the power criterion. Dotted lines correspond to the $\text{MMD}_{\ell}$ test. We report error bars corresponding to the standard deviation over $5$ experiments---the randomness comes from the estimation of $\lambda_1$ from $10^3$ sample points.
  • Figure 4: In this figure, we plot the cumulative distribution function of $T$. Left panel: change in the mean scenario. Right panel: change in the variance. Note that $F_T$ is becoming very "flat" when $\mu$ increases in the left panel, leading to numerical problems when solving $F_T(t)=1/2$.
  • Figure 5: In this figure, we plot $R_{\ell}$ (left panel) and $R_u$ (right panel) as a function of the bandwidth in the change of mean scenario. Note that both $R_{\ell}$ and $R_u$ are becoming very "flat" when $\mu$ goes to $0$, leading to numerical problems when maximizing with respect to $\nu$.
  • ...and 1 more figures

Theorems & Definitions (14)

  • Remark 2.1
  • Lemma 3.1: Gap between intra- and inter-distances
  • Remark 3.1
  • Proposition 3.1: CLT for non-identically distributed triangular array $U$-statistic
  • proof : Proof (sketch)
  • Proposition 3.2: Asymptotic normality of $H_n$
  • proof : Proof sketch
  • Proposition 4.1: Approximate Bahadur Slope computations
  • Lemma C.1: Controlling the variance of $A_n$, $B_n$, and $C_n$
  • proof
  • ...and 4 more