Table of Contents
Fetching ...

Bayes without Underfitting: Fully Correlated Deep Learning Posteriors via Alternating Projections

Marco Miani, Hrittik Roy, Søren Hauberg

TL;DR

This work proposes a matrix-free algorithm for projecting onto the null space of the generalized Gauss-Newton matrix, which scales linearly with the number of parameters and quadratically with the number of output dimensions, and an approximation that only scales linearly with parameters to make the method applicable to generative models.

Abstract

Bayesian deep learning all too often underfits so that the Bayesian prediction is less accurate than a simple point estimate. Uncertainty quantification then comes at the cost of accuracy. For linearized models, the null space of the generalized Gauss-Newton matrix corresponds to parameters that preserve the training predictions of the point estimate. We propose to build Bayesian approximations in this null space, thereby guaranteeing that the Bayesian predictive does not underfit. We suggest a matrix-free algorithm for projecting onto this null space, which scales linearly with the number of parameters and quadratically with the number of output dimensions. We further propose an approximation that only scales linearly with parameters to make the method applicable to generative models. An extensive empirical evaluation shows that the approach scales to large models, including vision transformers with 28 million parameters.

Bayes without Underfitting: Fully Correlated Deep Learning Posteriors via Alternating Projections

TL;DR

This work proposes a matrix-free algorithm for projecting onto the null space of the generalized Gauss-Newton matrix, which scales linearly with the number of parameters and quadratically with the number of output dimensions, and an approximation that only scales linearly with parameters to make the method applicable to generative models.

Abstract

Bayesian deep learning all too often underfits so that the Bayesian prediction is less accurate than a simple point estimate. Uncertainty quantification then comes at the cost of accuracy. For linearized models, the null space of the generalized Gauss-Newton matrix corresponds to parameters that preserve the training predictions of the point estimate. We propose to build Bayesian approximations in this null space, thereby guaranteeing that the Bayesian predictive does not underfit. We suggest a matrix-free algorithm for projecting onto this null space, which scales linearly with the number of parameters and quadratically with the number of output dimensions. We further propose an approximation that only scales linearly with parameters to make the method applicable to generative models. An extensive empirical evaluation shows that the approach scales to large models, including vision transformers with 28 million parameters.

Paper Structure

This paper contains 48 sections, 10 theorems, 49 equations, 6 figures, 6 tables.

Key Result

Lemma 3.1

The projected posterior (eq:approx_post) is supported on equal functions on the training data, i.e. $\forall \mathbf{x} \in \mathcal{D}$ which implies that $\textnormal{Var}_{\boldsymbol{\theta} \sim q_{\textnormal{proj}}} f_{\textnormal{lin}}^{\boldsymbol{\theta}_{\textnormal{map}}}(\boldsymbol{\theta}, \mathbf{x}) = 0$.

Figures (6)

  • Figure 1: Key idea: In overparametrized linear models, the kernel (null space) contains all models that have identical predictions on the training data. We propose restricting approximate posteriors of deep neural networks to this kernel to avoid underfitting.
  • Figure 2: Visualization of Jacobian projections Direct calculation of the projection(left) involves inverting a large $NO\!\times\! NO$ matrix. This is replaced by an infinite (in practice, truncated) series of cheap projections (right) which only require precomputing and storing inverses of several small $SO \!\times\! SO$ matrices.
  • Figure 3: The sparse approximations of the posterior fail to capture in-between uncertainty whereas fully correlated posteriors can capture them.
  • Figure 4: Variational autoencoder reconstructions and corresponding uncertainty estimates.Left: mean reconstructions of MNIST and Fashion MNIST images sampled from the latent space. Right: pixel-wise uncertainty estimates generated by sampling decoder parameters from the Loss Kernel Posterior, highlighting key semantic features such as edges and contours. This demonstrates Projected Laplace's ability to capture uncertainty in high-dimensional generative models.
  • Figure 5: Model calibration and fit on in-distribution test data (left) and under distribution shift (middle, right) where we plot shift intensities against accuracy and expected calibration error (ece), respectively.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Lemma 3.1
  • Lemma 3.1
  • Theorem 3.2
  • Lemma 3.3
  • Lemma 3.4
  • Lemma 4.1
  • Lemma 4.2
  • Lemma 4.2
  • proof
  • Theorem A.1
  • ...and 7 more