Table of Contents
Fetching ...

Near-Optimal Approximations for Bayesian Inference in Function Space

Veit Wild, James Wu, Dino Sejdinovic, Jeremias Knoblauch

TL;DR

This work introduces a scalable, nonparametric approach to Bayesian inference in function spaces by casting the posterior as the stationary distribution of an RKHS-valued Langevin diffusion. By projecting the infinite-dimensional diffusion onto the first $M$ Kosambi–Karhunen–Loève components and using a sufficiency-based pushforward, the method (Projected Langevin Sampling, PLS) achieves a computational cost of $O(M^3+JM^2)$ while retaining high-fidelity posterior approximations. The authors prove near-optimality of the projected method relative to the best nonparametric variational posterior, with stronger results in the Gaussian-likelihood case where PLS coincides with SVGP. The framework accommodates arbitrary likelihoods and yields non-Gaussian posteriors, enabling multimodal inference that SVGPs cannot capture. Numerically, PLS matches or surpasses SVGP on regression and classification benchmarks, while offering clear advantages in non-Gaussian and multimodal settings and enabling parallelized computation for large-scale problems.

Abstract

We propose a scalable inference algorithm for Bayes posteriors defined on a reproducing kernel Hilbert space (RKHS). Given a likelihood function and a Gaussian random element representing the prior, the corresponding Bayes posterior measure $Π_{\text{B}}$ can be obtained as the stationary distribution of an RKHS-valued Langevin diffusion. We approximate the infinite-dimensional Langevin diffusion via a projection onto the first $M$ components of the Kosambi-Karhunen-Loève expansion. Exploiting the thus obtained approximate posterior for these $M$ components, we perform inference for $Π_{\text{B}}$ by relying on the law of total probability and a sufficiency assumption. The resulting method scales as $O(M^3+JM^2)$, where $J$ is the number of samples produced from the posterior measure $Π_{\text{B}}$. Interestingly, the algorithm recovers the posterior arising from the sparse variational Gaussian process (SVGP) (see Titsias, 2009) as a special case, owed to the fact that the sufficiency assumption underlies both methods. However, whereas the SVGP is parametrically constrained to be a Gaussian process, our method is based on a non-parametric variational family $\mathcal{P}(\mathbb{R}^M)$ consisting of all probability measures on $\mathbb{R}^M$. As a result, our method is provably close to the optimal $M$-dimensional variational approximation of the Bayes posterior $Π_{\text{B}}$ in $\mathcal{P}(\mathbb{R}^M)$ for convex and Lipschitz continuous negative log likelihoods, and coincides with SVGP for the special case of a Gaussian error likelihood.

Near-Optimal Approximations for Bayesian Inference in Function Space

TL;DR

This work introduces a scalable, nonparametric approach to Bayesian inference in function spaces by casting the posterior as the stationary distribution of an RKHS-valued Langevin diffusion. By projecting the infinite-dimensional diffusion onto the first Kosambi–Karhunen–Loève components and using a sufficiency-based pushforward, the method (Projected Langevin Sampling, PLS) achieves a computational cost of while retaining high-fidelity posterior approximations. The authors prove near-optimality of the projected method relative to the best nonparametric variational posterior, with stronger results in the Gaussian-likelihood case where PLS coincides with SVGP. The framework accommodates arbitrary likelihoods and yields non-Gaussian posteriors, enabling multimodal inference that SVGPs cannot capture. Numerically, PLS matches or surpasses SVGP on regression and classification benchmarks, while offering clear advantages in non-Gaussian and multimodal settings and enabling parallelized computation for large-scale problems.

Abstract

We propose a scalable inference algorithm for Bayes posteriors defined on a reproducing kernel Hilbert space (RKHS). Given a likelihood function and a Gaussian random element representing the prior, the corresponding Bayes posterior measure can be obtained as the stationary distribution of an RKHS-valued Langevin diffusion. We approximate the infinite-dimensional Langevin diffusion via a projection onto the first components of the Kosambi-Karhunen-Loève expansion. Exploiting the thus obtained approximate posterior for these components, we perform inference for by relying on the law of total probability and a sufficiency assumption. The resulting method scales as , where is the number of samples produced from the posterior measure . Interestingly, the algorithm recovers the posterior arising from the sparse variational Gaussian process (SVGP) (see Titsias, 2009) as a special case, owed to the fact that the sufficiency assumption underlies both methods. However, whereas the SVGP is parametrically constrained to be a Gaussian process, our method is based on a non-parametric variational family consisting of all probability measures on . As a result, our method is provably close to the optimal -dimensional variational approximation of the Bayes posterior in for convex and Lipschitz continuous negative log likelihoods, and coincides with SVGP for the special case of a Gaussian error likelihood.

Paper Structure

This paper contains 62 sections, 16 theorems, 156 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

For the optimal variational approximation $\Pi^{*}_{M}$ of $\Pi_{\text{B}}$ over $\mathcal{Q}_M$ given by we have that $\Pi^{*}_{M} = {\Pi}_{\tau^*}$ with probability measure $\tau^*(u) \propto \exp\left( -V^*(u) \right)$, where for all $u \in \mathbb{R}^M$. Here, $\xi \sim \mathcal{N}(0,I_M)$, $\Lambda_M := \operatorname{diag}(\lambda_1,\hdots,\lambda_M)$, and for $u \in \mathbb{R}^M$, where $

Figures (6)

  • Figure 1: To represent a GRE $F$ on an infinite-dimensional space $H_k$, we orthogonally project it onto the first $M$ terms of its Kosambi-Karhunen-Loève representation via \ref{['eq:projection']}. Importantly, the projected random variable $\text{Proj}[F]$ can be exactly represented using its first $M$ coefficients $F^1, F^2, \dots F^M$, which are jointly distributed as a fully factorised $M$-dimensional zero-mean Gaussian with variances $\lambda_1^2, \lambda_2^2, \dots, \lambda_M^2$.
  • Figure 2: In an ideal world, we would directly evolve the SDE $(F(t)))_{t\geq 0}$ on $\mathcal{H}_k$. Since this is numerically infeasible however, we instead use the projection in \ref{['eq:projection']}. To evolve the resulting SDE, we rely on its representation via $M$ components $\widetilde{F}^{1:M}$, which we can evolve according to \ref{['eq:components-Langevin-2']}. In its exact form, this SDE depends on unknown quantities. We therefore have to estimate the unknown eigenvalue-eigenfunction pairs $\{{\lambda}_m, {e}_m \}_{m=1}^M$, resulting in our final target SDE $(\widehat{F}^{1:M}(t))_{t\geq 0}$ that we can approximate via \ref{['eq:approx-langevin-components']} by applying Euler-Maruyama discretisation and forward simulation.
  • Figure 3: Projected Langevin Sampling (PLS) implements an approximate sampling procedure for targeting the Bayes posterior $\Pi_{\text{B}}$. In particular, an $M$-dimensional Langevin SDE $(\widehat{F}^{1:M}(t))_{t\geq 0}$ is obtained by projection and estimation and approximated using Euler-Maruyama discretisation. Based on the approximation $\widehat{F}^{1:M}(T) \approx \widehat{\Pi}^{1:M}_{\infty}$ for large enough $T$, this allows us to approximate $\Pi_{\text{B}}$ as $\widehat{\Pi}_{\infty}$ via Matheron's Rule, which implements the integration in \ref{['eq:def-pi-hat']}.
  • Figure 4: Left: Our results show that while the increase in computation time for PLS with $J$ is roughlly linear, it is negligible relative to the dominant $\mathcal{O}(M^3)$ term. Middle: The computational complexity of PLS in $N$ is linear, but thanks to parallelisation this effect is barely noticable in practice. Right: The dominant term for computation is $\mathcal{O}(M^3)$ for both PLS and SVGP, but parallelisation and a much smaller constant factor obscured by the $\mathcal{O}$-notation make PLS substantively faster, even for large $M$.
  • Figure 5: Posterior inference for a Poisson regression with rate function $\lambda(x) = f(x)^2$. Left: Displayed are $J=100$ posterior sample draws from PLS. Due to the symmetry of the associated likelihood function around $f=0$, we observe symmetry around the x-axis. Right: Depicted is a histogram displaying the location of the posterior draws from PLS on the left at $x=-0.55$. The shape of the histogram reflects the observed bimodality, and is decidedly non-Gaussian.
  • ...and 1 more figures

Theorems & Definitions (19)

  • Theorem 1
  • Theorem 2
  • Lemma 3
  • Lemma 4
  • Definition 5
  • Theorem 6
  • Theorem 7
  • Theorem 8
  • Remark 9
  • Lemma 10
  • ...and 9 more