Linear Mode Connectivity under Data Shifts for Deep Ensembles of Image Classifiers
C. Hepburn, T. Zielke, A. P. Raulf
TL;DR
The paper investigates how data shifts shape linear mode connectivity (LMC) in deep ensembles across image-classification models. By partitioning training data to induce covariate shift, label imbalance, and domain shift, and by comparing MLPs, VGG, and ResNet under varying batch sizes and learning rates, the study links SGD noise to convergence to the same or different loss basins, using loss barriers, interpolation curves, and similarity metrics as probes. A key finding is that increasing batch size and/or decreasing learning rate reduces SGD noise from data shifts, promoting LMC and higher model similarity, though deep architectures can still exhibit barriers and degraded generalization under challenging shifts; BN can recover LMC where data normalization would otherwise break it. The results highlight a practical trade-off for deep ensembles: sampling multiple models via LMC can improve ensemble efficiency, but independent basins offer greater functional diversity and potential accuracy gains at the cost of training resources. These insights guide training choices in realistic, shift-prone deployment settings and underscore the importance of architecture, normalization, and optimization regime on ensemble behavior. $g = \varepsilon \left(\frac{N}{B}-1\right)$ encapsulates the SGD noise intuition underpinning these observations, connecting data shifts to optimization dynamics.$
Abstract
The phenomenon of linear mode connectivity (LMC) links several aspects of deep learning, including training stability under noisy stochastic gradients, the smoothness and generalization of local minima (basins), the similarity and functional diversity of sampled models, and architectural effects on data processing. In this work, we experimentally study LMC under data shifts and identify conditions that mitigate their impact. We interpret data shifts as an additional source of stochastic gradient noise, which can be reduced through small learning rates and large batch sizes. These parameters influence whether models converge to the same local minimum or to regions of the loss landscape with varying smoothness and generalization. Although models sampled via LMC tend to make similar errors more frequently than those converging to different basins, the benefit of LMC lies in balancing training efficiency against the gains achieved from larger, more diverse ensembles. Code and supplementary materials will be made publicly available at https://github.com/DLR-KI/LMC in due course.
