Table of Contents
Fetching ...

How Good is a Single Basin?

Kai Lion, Lorenzo Noci, Thomas Hofmann, Gregor Bachmann

TL;DR

The paper questions whether ensembles drawn from a single loss basin can match the predictive performance and calibration of traditional deep ensembles that sample across multiple basins. It systematically builds connected ensembles within one basin using methods like SWE and constrained training, and shows that increased connectivity often reduces diversity unless cross-basin information is incorporated. By exploring permutation-based alignment and, more effectively, distillation from multi-basin ensembles, the study demonstrates that much of the information from other basins can be re-discovered inside a single basin, yielding competitive or near-parity performance with deep ensembles. The findings imply that the loss landscape contains substantial cross-basin knowledge and motivate distillation-based strategies to harness it without leaving a basin, with implications for efficiency and architecture-dependent behavior (e.g., ViTs).

Abstract

The multi-modal nature of neural loss landscapes is often considered to be the main driver behind the empirical success of deep ensembles. In this work, we probe this belief by constructing various "connected" ensembles which are restricted to lie in the same basin. Through our experiments, we demonstrate that increased connectivity indeed negatively impacts performance. However, when incorporating the knowledge from other basins implicitly through distillation, we show that the gap in performance can be mitigated by re-discovering (multi-basin) deep ensembles within a single basin. Thus, we conjecture that while the extra-basin knowledge is at least partially present in any given basin, it cannot be easily harnessed without learning it from other basins.

How Good is a Single Basin?

TL;DR

The paper questions whether ensembles drawn from a single loss basin can match the predictive performance and calibration of traditional deep ensembles that sample across multiple basins. It systematically builds connected ensembles within one basin using methods like SWE and constrained training, and shows that increased connectivity often reduces diversity unless cross-basin information is incorporated. By exploring permutation-based alignment and, more effectively, distillation from multi-basin ensembles, the study demonstrates that much of the information from other basins can be re-discovered inside a single basin, yielding competitive or near-parity performance with deep ensembles. The findings imply that the loss landscape contains substantial cross-basin knowledge and motivate distillation-based strategies to harness it without leaving a basin, with implications for efficiency and architecture-dependent behavior (e.g., ViTs).

Abstract

The multi-modal nature of neural loss landscapes is often considered to be the main driver behind the empirical success of deep ensembles. In this work, we probe this belief by constructing various "connected" ensembles which are restricted to lie in the same basin. Through our experiments, we demonstrate that increased connectivity indeed negatively impacts performance. However, when incorporating the knowledge from other basins implicitly through distillation, we show that the gap in performance can be mitigated by re-discovering (multi-basin) deep ensembles within a single basin. Thus, we conjecture that while the extra-basin knowledge is at least partially present in any given basin, it cannot be easily harnessed without learning it from other basins.
Paper Structure (48 sections, 3 equations, 8 figures, 5 tables)

This paper contains 48 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of toy deep ensemble $\{\bm{\theta}_1, \bm{\theta}_2, \bm{\theta}_3\}$ and the matching, connected ensemble $\{\tilde{\bm{\theta}}_1, \tilde{\bm{\theta}}_2, \tilde{\bm{\theta}}_3\}$.
  • Figure 2: Linear Mode Connectivity of ResNet20 and ViT ensembles. We approximate $q_{pair}$ through lines showing averages of five randomly selected pairs. The experiment is repeated with three random seeds, totalling 15 pairs. The shading shows the standard deviation.
  • Figure 3: Connectivity $\bar{q}$ plotted against test accuracy for ResNet20 on CIFAR100. The dashed horizontal line shows the accuracy of a deep ensemble, while the dotted horizontal line shows the mean member accuracy.
  • Figure 4: The plots display the 2D planes spanned by the three weight vectors given by the parameters of a ResNet20 trained on Tiny ImageNet mentioned in the legend with the first model at the origin. The plane is constructed as in garipov_loss_2018.
  • Figure 5: Accuracy, loss, and mean accuracy as a function of time parameter $t$ for ResNet20 on CIFAR100. The dashed vertical lines mark the $t$ used in Table \ref{['tab:beta_table']}.
  • ...and 3 more figures