Variational Bayesian Last Layers

James Harrison; John Willes; Jasper Snoek

Variational Bayesian Last Layers

James Harrison, John Willes, Jasper Snoek

TL;DR

This work addresses reliable uncertainty estimation in deep networks with minimal overhead by proposing Variational Bayesian Last Layers (VBLLs), a sampling-free, last-layer Bayesian approach that yields a tractable, deterministic lower bound on the marginal likelihood. By formulating ELBOs for regression, discriminative, and generative classification, and offering training variants that jointly optimize the last layer or operate post hoc with frozen features, VBLLs enable scalable uncertainty quantification with near-quadratic complexity in the last-layer width. Empirical results across regression, image classification, sentiment analysis with LLM features, and contextual bandits demonstrate improved predictive accuracy, calibration, and out-of-distribution detection relative to strong baselines, while preserving compatibility with standard architectures. The work also provides practical guidance on hyperparameters, prediction strategies, and potential extensions, including combining VBLL with variational feature learning for collapsed VI.”

Abstract

We introduce a deterministic variational formulation for training Bayesian last layer neural networks. This yields a sampling-free, single-pass model and loss that effectively improves uncertainty estimation. Our variational Bayesian last layer (VBLL) can be trained and evaluated with only quadratic complexity in last layer width, and is thus (nearly) computationally free to add to standard architectures. We experimentally investigate VBLLs, and show that they improve predictive accuracy, calibration, and out of distribution detection over baselines across both regression and classification. Finally, we investigate combining VBLL layers with variational Bayesian feature learning, yielding a lower variance collapsed variational inference method for Bayesian neural networks.

Variational Bayesian Last Layers

TL;DR

Abstract

Paper Structure (49 sections, 11 theorems, 84 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 49 sections, 11 theorems, 84 equations, 7 figures, 9 tables, 1 algorithm.

Introduction
Bayesian Last Layer Neural Networks
Regression
Discriminative Classification
Generative Classification
Inference and Training in BLL Models
Sampling-Free Variational Inference for BLL Networks
Regression
Discriminative Classification
Generative Classification
Training VBLL Models
Prediction with VBLL Models
Related Work and Discussion
Experiments
Regression
...and 34 more sections

Key Result

Theorem 1

Let $q({\bm{\xi}}\mid \bm{\eta}) = \mathcal{N}(\bar{{\bm{w}}}, S)$ denote the variational posterior for the BLL model defined in Section sec:reg. Then, eq:elbo holds with

Figures (7)

Figure 1: Left: A variational BLL (VBLL) regression model with BBB features trained on 50 data points generated from a cubic function with additive Gaussian noise. The plot shows the 95% predictive credible region under the variational posterior for several sampled feature weights. Right: Visualizing (re-scaled) $p(\bm{x} \mid \bm{y} = 1) - p(\bm{x} \mid \bm{y} = 0)$ predicted by a generative VBLL model on the half moon dataset, shows good sensitivity to Euclidean distance and sensible embedding densities.
Figure 2: A performance comparison of G-VBLL, D-VBLL, and baseline MLP models on the IMDB Sentiment Classification Dataset. The models utilize text embeddings extracted from a pre-trained OPT-175B model. Results are presented across multiple training dataset scales, and the shaded regions represent $1\sigma$ error bounds.
Figure 3: Weight decay (left) and our KL/Inverse-Wishart regularizers (right) plotted versus $\exp(\bm{p}_k)$ (which corresponds to the diagonal element of the covariance matrix). Different curves show varying weight decay strength and varying $a$ term in \ref{['eq:our_reg']}, with $b=1$.
Figure 4: Sweeping over our modified hyperparameter representation. Left: sweeping over desired predictive variance $\hat{s}$, with $a=100$. Right: sweeping over regularization scale $a$ with fixed desired predictive variance $\hat{s} = 1$. Note that all functions asymptote at $\exp(2 \bm{p}_k) = 0$. In these figures, the curves have been vertically shifted to achieve a minimum at zero; this vertical shift does not impact regularization.
Figure 5: Sweeping over the $\Sigma$ location parameter for UCI datasets Energy (left) and Wine (right). The dotted colored lines correspond to $\Sigma^{-1}$ values over the course of training, and solid colored lines correspond to the Frobenius norm of $S$. The black dotted lines correspond to target $\Sigma^{-1}$ values. The scale hyparparameter was large in these experiments to illustrate the ability to effectively control noise covariance. Note that for very small $\Sigma^{-1}$, the impcat of the predictive loss limits the degree to which realized noise covariance matches the goal value; this trade-off is controlled by scale parameters.
...and 2 more figures

Theorems & Definitions (19)

Theorem 1
Theorem 2
Theorem 3
Lemma 4
proof
Corollary 1
proof
Corollary 2
proof
Lemma 5
...and 9 more

Variational Bayesian Last Layers

TL;DR

Abstract

Variational Bayesian Last Layers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (19)