Table of Contents
Fetching ...

Linear Mode Connectivity under Data Shifts for Deep Ensembles of Image Classifiers

C. Hepburn, T. Zielke, A. P. Raulf

TL;DR

The paper investigates how data shifts shape linear mode connectivity (LMC) in deep ensembles across image-classification models. By partitioning training data to induce covariate shift, label imbalance, and domain shift, and by comparing MLPs, VGG, and ResNet under varying batch sizes and learning rates, the study links SGD noise to convergence to the same or different loss basins, using loss barriers, interpolation curves, and similarity metrics as probes. A key finding is that increasing batch size and/or decreasing learning rate reduces SGD noise from data shifts, promoting LMC and higher model similarity, though deep architectures can still exhibit barriers and degraded generalization under challenging shifts; BN can recover LMC where data normalization would otherwise break it. The results highlight a practical trade-off for deep ensembles: sampling multiple models via LMC can improve ensemble efficiency, but independent basins offer greater functional diversity and potential accuracy gains at the cost of training resources. These insights guide training choices in realistic, shift-prone deployment settings and underscore the importance of architecture, normalization, and optimization regime on ensemble behavior. $g = \varepsilon \left(\frac{N}{B}-1\right)$ encapsulates the SGD noise intuition underpinning these observations, connecting data shifts to optimization dynamics.$

Abstract

The phenomenon of linear mode connectivity (LMC) links several aspects of deep learning, including training stability under noisy stochastic gradients, the smoothness and generalization of local minima (basins), the similarity and functional diversity of sampled models, and architectural effects on data processing. In this work, we experimentally study LMC under data shifts and identify conditions that mitigate their impact. We interpret data shifts as an additional source of stochastic gradient noise, which can be reduced through small learning rates and large batch sizes. These parameters influence whether models converge to the same local minimum or to regions of the loss landscape with varying smoothness and generalization. Although models sampled via LMC tend to make similar errors more frequently than those converging to different basins, the benefit of LMC lies in balancing training efficiency against the gains achieved from larger, more diverse ensembles. Code and supplementary materials will be made publicly available at https://github.com/DLR-KI/LMC in due course.

Linear Mode Connectivity under Data Shifts for Deep Ensembles of Image Classifiers

TL;DR

The paper investigates how data shifts shape linear mode connectivity (LMC) in deep ensembles across image-classification models. By partitioning training data to induce covariate shift, label imbalance, and domain shift, and by comparing MLPs, VGG, and ResNet under varying batch sizes and learning rates, the study links SGD noise to convergence to the same or different loss basins, using loss barriers, interpolation curves, and similarity metrics as probes. A key finding is that increasing batch size and/or decreasing learning rate reduces SGD noise from data shifts, promoting LMC and higher model similarity, though deep architectures can still exhibit barriers and degraded generalization under challenging shifts; BN can recover LMC where data normalization would otherwise break it. The results highlight a practical trade-off for deep ensembles: sampling multiple models via LMC can improve ensemble efficiency, but independent basins offer greater functional diversity and potential accuracy gains at the cost of training resources. These insights guide training choices in realistic, shift-prone deployment settings and underscore the importance of architecture, normalization, and optimization regime on ensemble behavior. encapsulates the SGD noise intuition underpinning these observations, connecting data shifts to optimization dynamics.$

Abstract

The phenomenon of linear mode connectivity (LMC) links several aspects of deep learning, including training stability under noisy stochastic gradients, the smoothness and generalization of local minima (basins), the similarity and functional diversity of sampled models, and architectural effects on data processing. In this work, we experimentally study LMC under data shifts and identify conditions that mitigate their impact. We interpret data shifts as an additional source of stochastic gradient noise, which can be reduced through small learning rates and large batch sizes. These parameters influence whether models converge to the same local minimum or to regions of the loss landscape with varying smoothness and generalization. Although models sampled via LMC tend to make similar errors more frequently than those converging to different basins, the benefit of LMC lies in balancing training efficiency against the gains achieved from larger, more diverse ensembles. Code and supplementary materials will be made publicly available at https://github.com/DLR-KI/LMC in due course.

Paper Structure

This paper contains 20 sections, 3 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: The stability of mini-batch training depends on SGD noise scale $g$smith, a noisy estimate of the true loss gradient that arises from data shifts, as well as from random data sampling and/or augmentation frankle. In large-batch or small-learning rate regimes (alt), this noise vanishes; these parameters influence convergence towards local minimum or regions of the loss landscape with varying smoothness and generalization properties visualizing. Models, sampled via LMC tend to make the same mistake more often. Notation: w/o BN stands for 'without batch normalization'.
  • Figure 2: Illustration of the training scheme. Training data is partitioned into two disjoint subsets (\ref{['partitions']}). In case when SGD noise sample is fixed (as shown in the figure), image semantics is the same for the same batch and epoch across the training subsets, but varies across epochs due to data shuffling (pseudo-code is presented in the Appendix, \ref{['pseudocode']}).
  • Figure 3: Interpolated models may yield lower loss than the models they are interpolated between. When appropriate, barrier is measured with respect to local minima in "$\mathbf{\lambda}$ (shown with dashed lines in A.) Top: average loss versus interpolation parameter $\lambda$ shows different "basin structures". Bottom: corresponding accuracy (test data). The vertical orange line indicates the value of $\lambda$, at which barrier, $\mathcal{B}$ and accuracy difference, $\Delta$ were measured.
  • Figure 4: Non-zero barriers and significant drop in accuracy of deep models suggest that covariate shift alone alters SGD training trajectories. One-hidden-layer MLPs have stable training dynamics. Barrier, $\mathcal{B}$ and accuracy difference, $\Delta$ on linear interpolation for models of variable depth in small batch training regime (batch size 32) under a fixed SGD noise sample. Learning rates: $10^{-3}$ for MLP with one (L1) and three (L3) hidden layers and ResNet, $10^{-4}$ for VGG. Values from five experimental runs (different seeds). Horizontal dashed line indicates accuracy difference of $-5\%$.
  • Figure 5: Overfitting can explain poor generalization of interpolated models despite zero barriers. (A)-(C): average learning curves and interpolation curves for three-hidden-layer MLPs, trained without batch normalization on CIFAR-10 subsets with equal label splits under a fixed SGD noise sample (five experimental runs, batch size 32, learning rate $10^{-3}$). Interpolated models are evaluated on test data. (D)-(F): same for MLPs, trained with batch normalization. Notation: MLP L3 three-hidden-layer MLP, BN batch normalization.
  • ...and 9 more figures