Table of Contents
Fetching ...

On Local Posterior Structure in Deep Ensembles

Mikkel Jordahn, Jonas Vestergaard Jensen, Mikkel N. Schmidt, Michael Riis Andersen

TL;DR

The paper investigates whether adding local posterior structure to deep ensembles (DE-BNNs) improves uncertainty quantification. Across multiple datasets and architectures, it finds that large DEs generally outperform DE-BNNs on in-distribution metrics, while DE-BNNs can offer out-of-distribution gains at an in-distribution cost. The study evaluates SWAG, Last-Layer Laplace Approximation (LLLA), and LA-NF as post-hoc local-posterior methods and performs extensive sensitivity analyses, concluding that DEs are often the pragmatically preferable choice for large ensembles. It also provides practical guidance on when DE-BNNs may be preferable and open-sources a large set of trained models for further research.

Abstract

Bayesian Neural Networks (BNNs) often improve model calibration and predictive uncertainty quantification compared to point estimators such as maximum-a-posteriori (MAP). Similarly, deep ensembles (DEs) are also known to improve calibration, and therefore, it is natural to hypothesize that deep ensembles of BNNs (DE-BNNs) should provide even further improvements. In this work, we systematically investigate this across a number of datasets, neural network architectures, and BNN approximation methods and surprisingly find that when the ensembles grow large enough, DEs consistently outperform DE-BNNs on in-distribution data. To shine light on this observation, we conduct several sensitivity and ablation studies. Moreover, we show that even though DE-BNNs outperform DEs on out-of-distribution metrics, this comes at the cost of decreased in-distribution performance. As a final contribution, we open-source the large pool of trained models to facilitate further research on this topic.

On Local Posterior Structure in Deep Ensembles

TL;DR

The paper investigates whether adding local posterior structure to deep ensembles (DE-BNNs) improves uncertainty quantification. Across multiple datasets and architectures, it finds that large DEs generally outperform DE-BNNs on in-distribution metrics, while DE-BNNs can offer out-of-distribution gains at an in-distribution cost. The study evaluates SWAG, Last-Layer Laplace Approximation (LLLA), and LA-NF as post-hoc local-posterior methods and performs extensive sensitivity analyses, concluding that DEs are often the pragmatically preferable choice for large ensembles. It also provides practical guidance on when DE-BNNs may be preferable and open-sources a large set of trained models for further research.

Abstract

Bayesian Neural Networks (BNNs) often improve model calibration and predictive uncertainty quantification compared to point estimators such as maximum-a-posteriori (MAP). Similarly, deep ensembles (DEs) are also known to improve calibration, and therefore, it is natural to hypothesize that deep ensembles of BNNs (DE-BNNs) should provide even further improvements. In this work, we systematically investigate this across a number of datasets, neural network architectures, and BNN approximation methods and surprisingly find that when the ensembles grow large enough, DEs consistently outperform DE-BNNs on in-distribution data. To shine light on this observation, we conduct several sensitivity and ablation studies. Moreover, we show that even though DE-BNNs outperform DEs on out-of-distribution metrics, this comes at the cost of decreased in-distribution performance. As a final contribution, we open-source the large pool of trained models to facilitate further research on this topic.

Paper Structure

This paper contains 31 sections, 16 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Top: Test ELPD change in percentage of DEs and DE-BNNs relative to MAP estimates with $K=1$. DEs are not included here for $K=1$ as these corresponds to the MAP estimates. Bottom: Test ELPD change in percentage of DE-BNNs relative to DEs. All points located below $y=0$ indicate that the DE of the same ensemble size outperform the equivalently sized DE-BNN method.
  • Figure 2: Left: Effect of MC samples in predictive posterior across datasets and DE-BNN methods with $K=20$. y-axis is ELPD percentage change for a given method relative to the MAP estimate with $K=1$. Right: Effect of stratification of samples across ensemble members for DE-BNN methods across datasets with $K=20$. y-axis is the difference in ELPD percentage change between stratified or non-stratified samples for a given method.
  • Figure 3: Test ELPD vs. covariance scaling factor $\lambda$ for WRN-16-4 on CIFAR-10 (left) and CIFAR-100 (right). Dotted lines indicate DE performance for a given K.
  • Figure 4: Out-of-distribution performance versus in-distribution test performance for DEs and DE-BNNs. DEs are highlighted with black marker edges. Lines are drawn from each DE to the DE-BNN that performs the best on the AUROC metric for a given $K$ and dataset. Some points are faded for clarity.
  • Figure E.1: ELPD % Change for All Inference Methods and Datasets Excluding IVON on QM9. (top) is versus MAP ($K=1$) and (bottom) is versus DE with same $K$.
  • ...and 3 more figures