Table of Contents
Fetching ...

Layer-wise Linear Mode Connectivity

Linara Adilova, Maksym Andriushchenko, Michael Kamp, Asja Fischer, Martin Jaggi

TL;DR

Layer-wise Linear Mode Connectivity (LLMC) analyzes how averaging neural network parameters layer-by-layer affects the loss surface. The authors define layer-wise interpolation and show, through empirical studies on CNNs and language-model-like architectures, that layer-wise averaging barriers are rare, even when full-network paths exhibit barriers; this is partly explained by a layer-wise convexity property demonstrated in a minimal deep-linear network where $L(\alpha)$ is convex along per-layer cuts. A robustness perspective reveals that averaging-directions are particularly informative directions in parameter space, while special perturbations along null or training subspaces reveal distinct effects across layers. In federated and personalization settings, partial layer averaging generally does not outperform full averaging except in extreme non-i.i.d. cases, suggesting that LLMC insights primarily guide understanding of loss landscapes rather than straightforward partial aggregation strategies. Overall, LLMC provides a finer-grained view of model fusion in non-convex regimes and informs federated learning design and future theoretical work on layer-wise optimization dynamics.

Abstract

Averaging neural network parameters is an intuitive method for fusing the knowledge of two independent models. It is most prominently used in federated learning. If models are averaged at the end of training, this can only lead to a good performing model if the loss surface of interest is very particular, i.e., the loss in the midpoint between the two models needs to be sufficiently low. This is impossible to guarantee for the non-convex losses of state-of-the-art networks. For averaging models trained on vastly different datasets, it was proposed to average only the parameters of particular layers or combinations of layers, resulting in better performing models. To get a better understanding of the effect of layer-wise averaging, we analyse the performance of the models that result from averaging single layers, or groups of layers. Based on our empirical and theoretical investigation, we introduce a novel notion of the layer-wise linear connectivity, and show that deep networks do not have layer-wise barriers between them.

Layer-wise Linear Mode Connectivity

TL;DR

Layer-wise Linear Mode Connectivity (LLMC) analyzes how averaging neural network parameters layer-by-layer affects the loss surface. The authors define layer-wise interpolation and show, through empirical studies on CNNs and language-model-like architectures, that layer-wise averaging barriers are rare, even when full-network paths exhibit barriers; this is partly explained by a layer-wise convexity property demonstrated in a minimal deep-linear network where is convex along per-layer cuts. A robustness perspective reveals that averaging-directions are particularly informative directions in parameter space, while special perturbations along null or training subspaces reveal distinct effects across layers. In federated and personalization settings, partial layer averaging generally does not outperform full averaging except in extreme non-i.i.d. cases, suggesting that LLMC insights primarily guide understanding of loss landscapes rather than straightforward partial aggregation strategies. Overall, LLMC provides a finer-grained view of model fusion in non-convex regimes and informs federated learning design and future theoretical work on layer-wise optimization dynamics.

Abstract

Averaging neural network parameters is an intuitive method for fusing the knowledge of two independent models. It is most prominently used in federated learning. If models are averaged at the end of training, this can only lead to a good performing model if the loss surface of interest is very particular, i.e., the loss in the midpoint between the two models needs to be sufficiently low. This is impossible to guarantee for the non-convex losses of state-of-the-art networks. For averaging models trained on vastly different datasets, it was proposed to average only the parameters of particular layers or combinations of layers, resulting in better performing models. To get a better understanding of the effect of layer-wise averaging, we analyse the performance of the models that result from averaging single layers, or groups of layers. Based on our empirical and theoretical investigation, we introduce a novel notion of the layer-wise linear connectivity, and show that deep networks do not have layer-wise barriers between them.
Paper Structure (21 sections, 1 theorem, 3 equations, 31 figures, 2 tables)

This paper contains 21 sections, 1 theorem, 3 equations, 31 figures, 2 tables.

Key Result

Theorem 4.1

Let the squared loss of a deep linear network interpolated between two sets of parameters $\{\boldsymbol{W}^{(i)}\}_{i=1}^L$ and $\{\boldsymbol{W}'^{(i)}\}_{i=1}^L$ at any layer $k \in \{1, \dots, L\}$ with interpolation coefficient $\alpha$ be then $L(\alpha)$ is convex and there are no barriers in layer-wise interpolation.

Figures (31)

  • Figure 1: CIFAR-10 with ResNet18. Heatmap shows layer-wise averaging barriers for layers on Y-axis throughout training epochs on X-axis. First row shows the full networks averaging barrier.
  • Figure 2: CIFAR-10 with ResNet18. Full data training setup, from same initialization. Heatmap visualizes cumulative averaging, each layer added to the group of averaged layers one by one, starting from bottom or top.
  • Figure 3: Minimalistic example of the LLMC phenomenon with a 1D diagonal linear network: joint interpolation between $w$ and $w'$ leads to a barrier, while interpolating only the second layer leads to a much lower loss.
  • Figure 4: Layer-wise interpolations (left: model 1 $\rightarrow$ model 2 and model 2 $\rightarrow$ 1) and robustness to random perturbations of the same norm (right: model 1 $\rightarrow$ model 2 and model 2 $\rightarrow$ 1) for vision transformers trained on CIFAR-10 with different learning rates and data augmentations. Here X axis is the different interpolation points $\alpha$.
  • Figure 5: Test loss of two networks with perturbations of magnitude $\sigma$ in the training subspace, null space, and along their averaging direction; perturbing the full network and separate layers.
  • ...and 26 more figures

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 4.1: Layer-wise convexity
  • proof