Table of Contents
Fetching ...

Bayesian neural networks with interpretable priors from Mercer kernels

Alex Alberts, Ilias Bilionis

TL;DR

The paper addresses uncertainty quantification in neural networks by introducing Mercer priors, which place the BNN parameter distribution in the Mercer (spectral) representation of a target Gaussian-process kernel, so that outputs approximate GP draws $u_{\theta}\sim\mathcal{N}(0,S)$. This approach preserves GP interpretability while retaining the scalability of neural networks, employing stochastic gradient Langevin dynamics with unbiased estimators to draw samples from $p(\theta)$. The authors demonstrate the method through Brownian-motion and Brownian-bridge case studies and apply Mercer priors to GP regression with heteroscedastic noise, a periodic BNN, and a nonlinear PDE inverse problem, highlighting the method’s versatility. They analyze the influence of spectral truncation $K$ and network width on fidelity, discuss convergence in the infinite-width limit, and outline open theoretical questions around hyperparameters and rigorous convergence. Overall, Mercer priors offer a principled, scalable framework to inject GP-like priors into BNNs for uncertainty quantification and scientific inverse problems.

Abstract

Quantifying the uncertainty in the output of a neural network is essential for deployment in scientific or engineering applications where decisions must be made under limited or noisy data. Bayesian neural networks (BNNs) provide a framework for this purpose by constructing a Bayesian posterior distribution over the network parameters. However, the prior, which is of key importance in any Bayesian setting, is rarely meaningful for BNNs. This is because the complexity of the input-to-output map of a BNN makes it difficult to understand how certain distributions enforce any interpretable constraint on the output space. Gaussian processes (GPs), on the other hand, are often preferred in uncertainty quantification tasks due to their interpretability. The drawback is that GPs are limited to small datasets without advanced techniques, which often rely on the covariance kernel having a specific structure. To address these challenges, we introduce a new class of priors for BNNs, called Mercer priors, such that the resulting BNN has samples which approximate that of a specified GP. The method works by defining a prior directly over the network parameters from the Mercer representation of the covariance kernel, and does not rely on the network having a specific structure. In doing so, we can exploit the scalability of BNNs in a meaningful Bayesian way.

Bayesian neural networks with interpretable priors from Mercer kernels

TL;DR

The paper addresses uncertainty quantification in neural networks by introducing Mercer priors, which place the BNN parameter distribution in the Mercer (spectral) representation of a target Gaussian-process kernel, so that outputs approximate GP draws . This approach preserves GP interpretability while retaining the scalability of neural networks, employing stochastic gradient Langevin dynamics with unbiased estimators to draw samples from . The authors demonstrate the method through Brownian-motion and Brownian-bridge case studies and apply Mercer priors to GP regression with heteroscedastic noise, a periodic BNN, and a nonlinear PDE inverse problem, highlighting the method’s versatility. They analyze the influence of spectral truncation and network width on fidelity, discuss convergence in the infinite-width limit, and outline open theoretical questions around hyperparameters and rigorous convergence. Overall, Mercer priors offer a principled, scalable framework to inject GP-like priors into BNNs for uncertainty quantification and scientific inverse problems.

Abstract

Quantifying the uncertainty in the output of a neural network is essential for deployment in scientific or engineering applications where decisions must be made under limited or noisy data. Bayesian neural networks (BNNs) provide a framework for this purpose by constructing a Bayesian posterior distribution over the network parameters. However, the prior, which is of key importance in any Bayesian setting, is rarely meaningful for BNNs. This is because the complexity of the input-to-output map of a BNN makes it difficult to understand how certain distributions enforce any interpretable constraint on the output space. Gaussian processes (GPs), on the other hand, are often preferred in uncertainty quantification tasks due to their interpretability. The drawback is that GPs are limited to small datasets without advanced techniques, which often rely on the covariance kernel having a specific structure. To address these challenges, we introduce a new class of priors for BNNs, called Mercer priors, such that the resulting BNN has samples which approximate that of a specified GP. The method works by defining a prior directly over the network parameters from the Mercer representation of the covariance kernel, and does not rely on the network having a specific structure. In doing so, we can exploit the scalability of BNNs in a meaningful Bayesian way.

Paper Structure

This paper contains 20 sections, 5 theorems, 48 equations, 15 figures, 1 table.

Key Result

Theorem 3.1

Let $u \sim \mathcal{GP}(m,k)$ be a measurable Gaussian process. Then, the sample paths $u \in L^2(\Omega)$ with probability $1$ if and only if In this case, $u$ induces the Gaussian measure $\mathcal{N}(m,S)$ on $L^2(\Omega)$ with the covariance operator being $(Sv)(\cdot) = \int_{\Omega} k(\cdot,x)v(x)dx$, for $v \in L^2(\Omega)$.

Figures (15)

  • Figure 1: Theoretical cost analysis between the methods.
  • Figure 2: Comparison between true Brownian motion and samples generated with the Mercer prior. The samples are generated with $K = 1,000$ eigenvalues in the prior and each sample is evaluated at $100,000$ points.
  • Figure 3: Comparison between the true covariance function of Brownian motion $k(s,t) = \min(s,t)$ and the empirical covariance generated with the Mercer prior. The approximation is generated with $K = 1,000$ eigenvalues with a width of $1,000$ neurons.
  • Figure 4: Statistical tests for BNNs which follow Brownian motion.
  • Figure 5: BNNs sampled from the Mercer prior for Brownian motion with $K=20$ eigenvalues and eigenfunctions. On the left in (a), we plot $5$ BNN samples, and on the right in (b), we show the evolution of the cumulatize ratio of eigenvalues.
  • ...and 10 more figures

Theorems & Definitions (9)

  • Theorem 3.1: Theorem 2 rajput1972gaussian
  • Remark 3.1
  • Theorem 3.2: Mercer's theorem steinwart2008support
  • Proposition 3.1
  • proof
  • Remark 3.2
  • Definition 3.1: Reproducing kernel Hilbert space kanagawa2018gaussian
  • Lemma 3.1: Lemma 6.25 stuart2010inverse
  • Lemma 3.2: Lemma 6.27 stuart2010inverse