Table of Contents
Fetching ...

Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences

Motonobu Kanagawa, Philipp Hennig, Dino Sejdinovic, Bharath K Sriperumbudur

TL;DR

The paper surveys how Bayesian GP methods and frequentist RKHS kernel approaches are interconnected, highlighting exact correspondences such as GP posterior mean matching kernel ridge regression and GP posterior variance aligning with RKHS-based worst-case errors. It analyzes when GP draws lie in RKHSs, using spectral representations, Driscoll’s zero-one law, and the concept of RKHS powers to elucidate shared and distinct aspects of hypothesis spaces. It then links these ideas to convergence rates, integral transforms, and numerical methods, showing that regularization in KRR parallels additive noise in GP regression and that many kernel-based tools (MMD, HSIC, kernel quadrature, Bayesian quadrature) admit GP interpretations. The synthesis demonstrates that probabilistic and functional-analytic perspectives are not only compatible but mutually informative, enabling transfer of results and methods across Bayesian and frequentist kernels. Overall, the work provides a cohesive modern view of the deep connections between GP methods and RKHS kernel techniques with implications for theory and practice in statistical learning and numerical analysis.

Abstract

This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian processes on the one side, and frequentist kernel methods based on reproducing kernel Hilbert spaces on the other. It is widely known in machine learning that these two formalisms are closely related; for instance, the estimator of kernel ridge regression is identical to the posterior mean of Gaussian process regression. However, they have been studied and developed almost independently by two essentially separate communities, and this makes it difficult to seamlessly transfer results between them. Our aim is to overcome this potential difficulty. To this end, we review several old and new results and concepts from either side, and juxtapose algorithmic quantities from each framework to highlight close similarities. We also provide discussions on subtle philosophical and theoretical differences between the two approaches.

Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences

TL;DR

The paper surveys how Bayesian GP methods and frequentist RKHS kernel approaches are interconnected, highlighting exact correspondences such as GP posterior mean matching kernel ridge regression and GP posterior variance aligning with RKHS-based worst-case errors. It analyzes when GP draws lie in RKHSs, using spectral representations, Driscoll’s zero-one law, and the concept of RKHS powers to elucidate shared and distinct aspects of hypothesis spaces. It then links these ideas to convergence rates, integral transforms, and numerical methods, showing that regularization in KRR parallels additive noise in GP regression and that many kernel-based tools (MMD, HSIC, kernel quadrature, Bayesian quadrature) admit GP interpretations. The synthesis demonstrates that probabilistic and functional-analytic perspectives are not only compatible but mutually informative, enabling transfer of results and methods across Bayesian and frequentist kernels. Overall, the work provides a cohesive modern view of the deep connections between GP methods and RKHS kernel techniques with implications for theory and practice in statistical learning and numerical analysis.

Abstract

This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian processes on the one side, and frequentist kernel methods based on reproducing kernel Hilbert spaces on the other. It is widely known in machine learning that these two formalisms are closely related; for instance, the estimator of kernel ridge regression is identical to the posterior mean of Gaussian process regression. However, they have been studied and developed almost independently by two essentially separate communities, and this makes it difficult to seamlessly transfer results between them. Our aim is to overcome this potential difficulty. To this end, we review several old and new results and concepts from either side, and juxtapose algorithmic quantities from each framework to highlight close similarities. We also provide discussions on subtle philosophical and theoretical differences between the two approaches.

Paper Structure

This paper contains 64 sections, 34 theorems, 168 equations, 2 figures, 1 algorithm.

Key Result

Theorem 2.4

Let $k$ be a shift-invariant kernel on $\mathcal{X} = \mathbb{R}^d$ such that $k(x,y) := \Phi(x-y)$ for $\Phi \in C(\mathbb{R}^d) \cap L_1(\mathbb{R}^d)$. Then the RKHS $\mathcal{H}_k$ of $k$ is given by with the inner-product being where $\overline{\mathcal{F}[g](\omega)}$ denotes the complex conjugate of $\mathcal{F}[g](\omega)$.

Figures (2)

  • Figure 1: Conceptual sketches of Gaussian process regression (left, center) and kernel ridge regression (right). Left: Prior measure $\mathfrak{f} \sim \mathcal{GP}(0,k)$ with vanishing prior mean and the Matérn-class kernel $k(x,x') = (1 + \sqrt{5}r + 5/3 r^2) \exp(-\sqrt{5}r)$ with $r\colonequals |x-x'|$. Prior mean function in thick black. Two marginal standard deviations in thin black. Marginal densities as gray shading. 5 samples from prior as green lines. Center: Given a dataset $(X,Y)$ of $n=3$ data points with i.i.d. zero-mean normal noise of standard deviation $\sigma=0.1$, the posterior measure is also a Gaussian process, with updated mean and covariance functions (all quantities as on the left). Right: Kernel ridge regression yields a point estimate (thick black) that is exactly equal to the Gaussian process posterior mean. In contrast to Gaussian process regression, an error estimate is usually not provided. This absence can be deliberate, as one may not be willing to impose the assumptions necessary to define such an estimate (e.g., additive Gaussian noise assumption). For comparison with the GP samples, the plot also shows some functions with the property that $f_X^{\intercal} k_{XX}^{-1} f_X = \|f_X\|_{k_{XX}^{-1}} ^2 = 1$ (but KRR does not assume the true function is of this form).
  • Figure 2: In-model error estimation. Plots similar to Fig. \ref{['fig:GP_intro']}. Left: Hypothesis class/prior: The plot shows five sample paths from the GP prior in green and, for comparison, five functions with $f_X^{\intercal} k_{XX} ^{-1} f_X=1$ in red. In light gray in the background: Eigenfunction spectrum (regular grid over the continuous space of such functions), scaled by their eigenvalues. (See Sec. \ref{['sec:Mercer']} for eigen expansions of GP and RKHSs) Right: When constrained on noise-less observations, both Gaussian process regression and kernel ridge regression afford the same in-model error estimate, plotted as two thin black lines (Proposition \ref{['prop:wce_pvar']}). In the GP context, this is the error bar of one marginal standard deviation. In the kernel context, it is the worst case error if the true function has unit RKHS norm. The red functions (which approximate such unit-norm RKHS elements) lie entirely inside this region, while GP samples (green) lie inside it for $\sim 68\%$ of the path (the expected value, the Gaussian probability mass within one standard-deviation). Another visible feature is that the GP samples are rougher than the unit-norm representers.

Theorems & Definitions (112)

  • Definition 2.1: Positive definite kernels
  • Remark 2.1
  • Example 2.1: Gaussian RBF/Square-Exponential Kernels
  • Example 2.2: Matérn kernels
  • Remark 2.2
  • Remark 2.3
  • Example 2.3: Polynomial kernels
  • Definition 2.2: Gaussian processes
  • Definition 2.2: Gaussian processes
  • Remark 2.4
  • ...and 102 more