Scalable Bayesian Inference in the Era of Deep Learning: From Gaussian Processes to Deep Neural Networks

Javier Antoran

Scalable Bayesian Inference in the Era of Deep Learning: From Gaussian Processes to Deep Neural Networks

Javier Antoran

TL;DR

This thesis tackles the scalability gap between Bayesian uncertainty quantification and modern deep learning. It develops scalable uncertainty estimation for neural networks by leveraging the linearised Laplace approximation, casting Bayesian neural inference as a Gaussian-linear problem and then solving it with stochastic methods. A key contribution is the SGD-based sampling framework for Gaussian processes and the sample-based EM algorithm for hyperparameter learning in linearised neural nets, enabling uncertainty estimates on large models like ResNet-50 trained on Imagenet and 3D tomographic reconstructions. The work also provides practical adaptations for modern networks, including normalisation layers and non-converged training, and demonstrates strong empirical performance on large-scale benchmarks and molecular prediction tasks, highlighting the potential for Bayesian reasoning in scalable deep learning.

Abstract

Large neural networks trained on large datasets have become the dominant paradigm in machine learning. These systems rely on maximum likelihood point estimates of their parameters, precluding them from expressing model uncertainty. This may result in overconfident predictions and it prevents the use of deep learning models for sequential decision making. This thesis develops scalable methods to equip neural networks with model uncertainty. In particular, we leverage the linearised Laplace approximation to equip pre-trained neural networks with the uncertainty estimates provided by their tangent linear models. This turns the problem of Bayesian inference in neural networks into one of Bayesian inference in conjugate Gaussian-linear models. Alas, the cost of this remains cubic in either the number of network parameters or in the number of observations times output dimensions. By assumption, neither are tractable. We address this intractability by using stochastic gradient descent (SGD) -- the workhorse algorithm of deep learning -- to perform posterior sampling in linear models and their convex duals: Gaussian processes. With this, we turn back to linearised neural networks, finding the linearised Laplace approximation to present a number of incompatibilities with modern deep learning practices -- namely, stochastic optimisation, early stopping and normalisation layers -- when used for hyperparameter learning. We resolve these and construct a sample-based EM algorithm for scalable hyperparameter learning with linearised neural networks. We apply the above methods to perform linearised neural network inference with ResNet-50 (25M parameters) trained on Imagenet (1.2M observations and 1000 output dimensions). Additionally, we apply our methods to estimate uncertainty for 3d tomographic reconstructions obtained with the deep image prior network.

Scalable Bayesian Inference in the Era of Deep Learning: From Gaussian Processes to Deep Neural Networks

TL;DR

Abstract

Paper Structure (174 sections, 6 theorems, 204 equations, 68 figures, 13 tables)

This paper contains 174 sections, 6 theorems, 204 equations, 68 figures, 13 tables.

Introduction
Thesis outline and contributions
Full list of publications
Bayesian reasoning with Gaussian linear models and Gaussian processes
The weight space view: Gaussian linear regression
Notation for probability distributions
Understanding our choice of model
Posterior inference: from loss functions to distributions
The function space view: Gaussian processes
Duality
From features to kernels
Bayesian reasoning about functions: Gaussian processes
Sampling from Gaussian processes & random features
Matrix square root sampling
Random feature prior sampling
...and 159 more sections

Key Result

Proposition 0

Let $\delta>0$. Let $B^{-1} = b^{-1}{I}$ for $b^{-1} > 0$. Let $\mu_{\text{SGD}}$ be the predictive mean function obtained by arithmetically-averaged SGD after $t$ steps, starting from an initial set of representer weights equal to zero, and using a sufficiently small learning rate of $0 < \beta <\f

Figures (68)

Figure 1: Each plot displays four prior function samples, drawn using \ref{['eq:prior_function_sampling_weight_space']}. The left side plot uses an affine basis expansion \ref{['eq:affine_expansion']}, the middle one a 500 element random Fourier expansion with a Gaussian spectral measure and a lengthscale of $\psi=1$\ref{['eq:random_fourier_basis']}, and the right side plot uses a similar Fourier expansion but with a lengthscale of $\psi=0.3$.
Figure 2: Covariance matrices of the prior distribution over functions evaluated at 501 equally spaced points in the range $[-3, 3]$ The left side plot uses an affine basis expansion \ref{['eq:affine_expansion']}, the middle one a 500 element random Fourier expansion with a Gaussian spectral measure and a lengthscale of $\psi=1$\ref{['eq:random_fourier_basis']}, and the right side plot uses a similar Fourier expansion but with a lengthscale of $\psi=0.3$.
Figure 3: The top left plot shows the $d=2$ dimensional posterior landscape of our affine model fit on a $n=6$ observation dataset with $B=2I$ and $A=6I$. The 1, 2 and 3 standard deviation prior and posterior contours are overlayed on top. We draw 2 samples from the weight space posterior, which we plot as function samples in the top right plot. The top right plot also displays the mean and 2 standard deviation contours of the posterior random function $f | Y$. The bottom left and bottom right plots display the same objects as the top right, but for the 500 element random Fourier basis with a Gaussian spectral measure. We set $A=0.4I$ for the Fourier models. The lengthscale on the left is $\psi=1$ and the right side plot uses $\psi=0.3$.
Figure 4: Left: RBF kernel ($\psi=0.5$) evaluation functionals for each observation (black dots) in a toy 1d dataset. Right: the posterior mean function is a linear combination of evaluation functionals.
Figure 5: Convergence of random Fourier feature basis (given in \ref{['eq:random_fourier_basis']}) to the RBF kernel's evaluation functional $k(0, \cdot)$ using the estimator in \ref{['eq:unbiased_rff_kernel']} as the number of random features $d$ increases.
...and 63 more figures

Theorems & Definitions (12)

Proposition 0
proof
Definition 2: Normalised networks
Proposition 3
Proposition 4
Lemma 5
proof
Lemma 6
proof
proof : Proof of \ref{['proposition:unique-posterior']}
...and 2 more

Scalable Bayesian Inference in the Era of Deep Learning: From Gaussian Processes to Deep Neural Networks

TL;DR

Abstract

Scalable Bayesian Inference in the Era of Deep Learning: From Gaussian Processes to Deep Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (68)

Theorems & Definitions (12)