Table of Contents
Fetching ...

Student-t processes as infinite-width limits of posterior Bayesian neural networks

Francesco Caporali, Stefano Favaro, Dario Trevisan

TL;DR

This proof shows that, if the parameters of a BNN follow a Gaussian prior distribution, and the variance of both the last hidden layer and the Gaussian likelihood function follows an Inverse-Gamma prior distribution, the resulting posterior BNN converges to a Student-t process in the infinite-width limit.

Abstract

The asymptotic properties of Bayesian Neural Networks (BNNs) have been extensively studied, particularly regarding their approximations by Gaussian processes in the infinite-width limit. We extend these results by showing that posterior BNNs can be approximated by Student-t processes, which offer greater flexibility in modeling uncertainty. Specifically, we show that, if the parameters of a BNN follow a Gaussian prior distribution, and the variance of both the last hidden layer and the Gaussian likelihood function follows an Inverse-Gamma prior distribution, then the resulting posterior BNN converges to a Student-t process in the infinite-width limit. Our proof leverages the Wasserstein metric to establish control over the convergence rate of the Student-t process approximation.

Student-t processes as infinite-width limits of posterior Bayesian neural networks

TL;DR

This proof shows that, if the parameters of a BNN follow a Gaussian prior distribution, and the variance of both the last hidden layer and the Gaussian likelihood function follows an Inverse-Gamma prior distribution, the resulting posterior BNN converges to a Student-t process in the infinite-width limit.

Abstract

The asymptotic properties of Bayesian Neural Networks (BNNs) have been extensively studied, particularly regarding their approximations by Gaussian processes in the infinite-width limit. We extend these results by showing that posterior BNNs can be approximated by Student-t processes, which offer greater flexibility in modeling uncertainty. Specifically, we show that, if the parameters of a BNN follow a Gaussian prior distribution, and the variance of both the last hidden layer and the Gaussian likelihood function follows an Inverse-Gamma prior distribution, then the resulting posterior BNN converges to a Student-t process in the infinite-width limit. Our proof leverages the Wasserstein metric to establish control over the convergence rate of the Student-t process approximation.

Paper Structure

This paper contains 27 sections, 14 theorems, 161 equations, 4 figures, 1 algorithm.

Key Result

Theorem 2.1

Given $(\bm{x}_n)_{n = 1}^{\infty}$, $\bm{x}$ random variables, then $\lim_{n \to \infty} \mathcal{W}_{p}\left(\bm{x}_n, \bm{x}\right) = 0$ if and only if $\bm{x}_n \xrightarrow{law} \bm{x}$, and $\lim_{n \to \infty} \mathbb{E}_{}\left[\left\lVert\bm{x}_n\right\rVert^p\right] = \mathbb{E}_{}\left[\l

Figures (4)

  • Figure 1: Dependency of results for the proof of \ref{['thm:studposterior']}.
  • Figure 2: Sequence of posterior BNNs, $(f_{\bm{\theta}_n} \, | \, \mathcal{D})_n$ (in gray), converging to the corresponding posterior Student-$t$ process, $G \, | \, \mathcal{D}$ (in green), in the infinite-width limit. Given $\mathcal{D}$ (in red), training set, we sampled $100$ values from both $G \, | \, \mathcal{D}$ and $f_{\bm{\theta}_n} \, | \, \mathcal{D}$ for each width $n \in \{2^0, \dots, 2^7\}$, following \ref{['rem:poststudtprocess']} and \ref{['alg:samplingpostbnn']}, respectively. The networks used have $2$ hidden layers, erf activations and parameter variances set to $5$. Additionally, the hyperparameters $(a, b)$ are set to $(3, 2)$.
  • Figure 3: Sequence of posterior BNNs, $(f_{\bm{\theta}_n} \, | \, \mathcal{D})_n$ (in gray), converging to the corresponding posterior Gaussian process, $G \, | \, \mathcal{D}$ (in green), in the infinite-width limit. Given $\mathcal{D}$ (in red), training set, we sampled $100$ values from both $G \, | \, \mathcal{D}$ and $f_{\bm{\theta}_n} \, | \, \mathcal{D}$ for each width $n \in \{2^0, \dots, 2^7\}$. The sampling was performed following gp2006 for $G \, | \, \mathcal{D}$ and the built-in NUTS algorithm in Pyro for $f_{\bm{\theta}_n} \, | \, \mathcal{D}$. The networks used have $2$ hidden layers, erf activations, parameter variances set to $2$, and likelihood variance set to $0.1$.
  • Figure 4: Posterior Student-$t$ process (on the right) and posterior Gaussian process (on the left). We followed the same strategy and used the same parameters introduced to generate \ref{['fig:normalinvgammaprior', 'fig:gaussianprior']}.

Theorems & Definitions (38)

  • Theorem 2.1: Theorem 6.9 of otvillani2008
  • Definition 2.2: Fully connected feed-forward NN
  • Definition 2.3: BNN
  • Remark 2.4
  • Remark 2.5
  • Definition 2.6
  • Remark 2.7
  • Theorem 2.8: basteri2022trevisan2023
  • Remark 2.9
  • Remark 3.1
  • ...and 28 more