Variational Linearized Laplace Approximation for Bayesian Deep Learning

Luis A. Ortega; Simón Rodríguez Santana; Daniel Hernández-Lobato

Variational Linearized Laplace Approximation for Bayesian Deep Learning

Luis A. Ortega, Simón Rodríguez Santana, Daniel Hernández-Lobato

TL;DR

The paper tackles the challenge of reliable uncertainty estimation for large deep networks, where traditional Laplace-based approaches incur prohibitive cost. It introduces Variational LLA (VaLLA), a decoupled sparse Gaussian Process in an RKHS that keeps the neural network’s MAP predictions intact while learning a scalable posterior over functions via inducing points. VaLLA leverages variational sparse GP theory, dual RKHS representations, and an alpha-divergence objective to enable mini-batch optimization, achieving sub-linear training in dataset size and competitive predictive distributions compared to ELLA and other LLA variants. Empirical results across synthetic data, large-scale regression, and image classification demonstrate VaLLA’s strong uncertainty quantification (NLL, CQM, OOD-AUC) and favorable computation, highlighting its potential for scalable Bayesian deep learning and robust decision-making under uncertainty.

Abstract

The Linearized Laplace Approximation (LLA) has been recently used to perform uncertainty estimation on the predictions of pre-trained deep neural networks (DNNs). However, its widespread application is hindered by significant computational costs, particularly in scenarios with a large number of training points or DNN parameters. Consequently, additional approximations of LLA, such as Kronecker-factored or diagonal approximate GGN matrices, are utilized, potentially compromising the model's performance. To address these challenges, we propose a new method for approximating LLA using a variational sparse Gaussian Process (GP). Our method is based on the dual RKHS formulation of GPs and retains, as the predictive mean, the output of the original DNN. Furthermore, it allows for efficient stochastic optimization, which results in sub-linear training time in the size of the training dataset. Specifically, its training cost is independent of the number of training points. We compare our proposed method against accelerated LLA (ELLA), which relies on the Nyström approximation, as well as other LLA variants employing the sample-then-optimize principle. Experimental results, both on regression and classification datasets, show that our method outperforms these already existing efficient variants of LLA, both in terms of the quality of the predictive distribution and in terms of total computational time.

Variational Linearized Laplace Approximation for Bayesian Deep Learning

TL;DR

Abstract

Paper Structure (34 sections, 5 theorems, 82 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 34 sections, 5 theorems, 82 equations, 8 figures, 1 table, 1 algorithm.

Introduction
Background
Gaussian Process (GP) Interpretation of LLA
Dual formulation of Gaussian Processes in RKHS
Variational LLA (VaLLA)
Variational Sparse GPs
Using Decoupled SGP and LLA
MAP solution and Hilbert space.
Hessian Approximation in VaLLA
Hyper-parameter Tuning and $\alpha$-divergences
Mini-batch Optimization.
Prediction.
Inducing Points.
Limitations of VaLLA
Related Work
...and 19 more sections

Key Result

Theorem 1

Using a sparse GP approximation with $q(\mathbf{f}, \mathbf{u}) = p(\mathbf{f}|\mathbf{u})q(\mathbf{u})$ is equivalent to restricting the mean and covariance functions of the dual representation in the RKHS to where the functional $\Phi_{\mathbf{Z}}: \mathbb{R}^M \to \mathcal{H}$ defines a linear combination of basis functions as $\Phi_{\mathbf{Z}}(\bm{a}) = \sum_{m=1}^{M} a_m \phi_{\mathbf{z}_m}

Figures (8)

Figure 1: Predictive distribution (mean in blue and shaded two times the standard deviation) on a toy 1D regression dataset with a 2 hidden layer MLP with $50$ units trained using back-propagation. The predictive distribution of VaLLA is on par with or better than other approximations (last layer and Kronecker factorization), MoE LLA with $200$ clusters and other methods (ELLA). The optimal values for the noise and prior are optimized by maximizing the marginal log likelihood estimate of LLA. VaLLA and ELLA use the optimal values found by LLA. VaLLA uses $20$ inducing points for the predictive variances. ELLA uses $20$ random locations and $20$ features.
Figure 2: (left) Results on regression datasets. (right) Illustration of CQM on Taxi. Average results across 5 different random seeds (standard deviations always $<10^{-4}$ and omitted). Best value highlighted in purple and second to best in teal. $^*$ for Last Layer LLA.
Figure 3: (left) MNIST experiments. Results averaged over 5 different random seeds (standard deviations $<10^{-4}$ in all cases and omitted). (right) Box-plots of training times in seconds. ELLA considers 10 prior values chosen using a validation set. Sampled-LLA uses 8 EM steps and 32 samples. Best value is highlighted in purple and second to best in teal. $^*$ for Last Layer LLA.
Figure 4: (left) Results on FMNIST. Results are averaged over 5 different random seeds (standard deviations are lower than $10^{-4}$ and omitted). Best value is highlighted in purple and second to best in teal. $^*$ for Last Layer LLA. (right) ECE and NLL for rotated FMNIST.
Figure 5: Results on corrupted CIFAR10 with ResNet56. Sampled LLA uses $64$ samples and ELLA uses $M=2000$ and $K=20$.
...and 3 more figures

Theorems & Definitions (8)

Theorem 1: cheng2016incremental
Proposition 1
Proposition 2
proof
Proposition 1
proof
Proposition 2
proof

Variational Linearized Laplace Approximation for Bayesian Deep Learning

TL;DR

Abstract

Variational Linearized Laplace Approximation for Bayesian Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (8)