Table of Contents
Fetching ...

Reparameterization invariance in approximate Bayesian inference

Hrittik Roy, Marco Miani, Carl Henrik Ek, Philipp Hennig, Marvin Pförtner, Lukas Tatzel, Søren Hauberg

TL;DR

It is observed that linearized predictives alleviate the common underfitting problems of the Laplace approximation, and a new geometric view of reparametrizations is developed from which the success of linearization is explained.

Abstract

Current approximate posteriors in Bayesian neural networks (BNNs) exhibit a crucial limitation: they fail to maintain invariance under reparameterization, i.e. BNNs assign different posterior densities to different parametrizations of identical functions. This creates a fundamental flaw in the application of Bayesian principles as it breaks the correspondence between uncertainty over the parameters with uncertainty over the parametrized function. In this paper, we investigate this issue in the context of the increasingly popular linearized Laplace approximation. Specifically, it has been observed that linearized predictives alleviate the common underfitting problems of the Laplace approximation. We develop a new geometric view of reparametrizations from which we explain the success of linearization. Moreover, we demonstrate that these reparameterization invariance properties can be extended to the original neural network predictive using a Riemannian diffusion process giving a straightforward algorithm for approximate posterior sampling, which empirically improves posterior fit.

Reparameterization invariance in approximate Bayesian inference

TL;DR

It is observed that linearized predictives alleviate the common underfitting problems of the Laplace approximation, and a new geometric view of reparametrizations is developed from which the success of linearization is explained.

Abstract

Current approximate posteriors in Bayesian neural networks (BNNs) exhibit a crucial limitation: they fail to maintain invariance under reparameterization, i.e. BNNs assign different posterior densities to different parametrizations of identical functions. This creates a fundamental flaw in the application of Bayesian principles as it breaks the correspondence between uncertainty over the parameters with uncertainty over the parametrized function. In this paper, we investigate this issue in the context of the increasingly popular linearized Laplace approximation. Specifically, it has been observed that linearized predictives alleviate the common underfitting problems of the Laplace approximation. We develop a new geometric view of reparametrizations from which we explain the success of linearization. Moreover, we demonstrate that these reparameterization invariance properties can be extended to the original neural network predictive using a Riemannian diffusion process giving a straightforward algorithm for approximate posterior sampling, which empirically improves posterior fit.
Paper Structure (55 sections, 12 theorems, 64 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 55 sections, 12 theorems, 64 equations, 8 figures, 2 tables, 1 algorithm.

Key Result

Lemma 4.3

$\sim$ is an equivalence relation, i.e. it is transitive, symmetric and reflexive. We can form the quotient space $\mathcal{P} = \mathbb{R}^D / \sim$ of effective parameters. We denote $[\mathbf{w}]\in\mathcal{P}$ the equivalence class of an element $\mathbf{w}\in\mathbb{R}^D$.

Figures (8)

  • Figure 1: The function space is decomposed into directions of reparameterizations (kernel) and functional change (non-kernel). We improve the posterior fit by concentrating probability mass on directions of functional change.
  • Figure 2: The weight space can be decomposed into directions of reparameterizations and functional changes. For linear models (left) these are linear subspaces given by the kernel and the image, respectively. For nonlinear models, these are the nonlinear manifolds $\mathcal{P}_{\mathbf{w}_i}^{\perp}$ and $\mathcal{P}_{\mathbf{w}_i}$, respectively.
  • Figure 3: Underfitting of sampled Laplace is less pronounced when the rank of the ggn is higher for a fixed number of parameters. This is consistent with our hypothesis as a high ggn rank implies a lower dimensional kernel. For experimental details, see appendix \ref{['sec: toy_results']}.
  • Figure 4: Benchmark results for Rotated mnist (similar results for fmnist and cifar are in appendix \ref{['sec:robustness']}). Sampled Laplace significantly underfits even for non-rotated data. Laplace diffusion consistently outperforms the other methods.
  • Figure 5: Eigenvalues of the $\textsc{ggn}$ of a Convolutional Neural Network trained on MNIST.
  • ...and 3 more figures

Theorems & Definitions (21)

  • Definition 4.1
  • Definition 4.2
  • Lemma 4.3
  • Proposition 4.4
  • Theorem 4.5
  • Theorem 4.6
  • Theorem 5.1
  • Theorem B.1
  • Proposition B.2
  • proof
  • ...and 11 more