Table of Contents
Fetching ...

Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity

Akira Ito, Masanori Yamada, Daiki Chijiwa, Atsutoshi Kumagai

TL;DR

This work is the first to show that widening the model not only facilitates nonlinear mode connectivity, as suggested in prior research, but also significantly increases the possibility of achieving linear mode connectivity.

Abstract

Recently, Ainsworth et al. empirically demonstrated that, given two independently trained models, applying a parameter permutation that preserves the input-output behavior allows the two models to be connected by a low-loss linear path. When such a path exists, the models are said to achieve linear mode connectivity (LMC). Prior studies, including Ainsworth et al., have reported that achieving LMC requires not only an appropriate permutation search but also sufficiently wide models (e.g., a 32 $\times$ width multiplier for ResNet-20). This is broadly believed to be because increasing the model width ensures a large enough space of candidate permutations, increasing the chance of finding one that yields LMC. In this work, we empirically demonstrate that, even without any permutations, simply widening the models is sufficient for achieving LMC when using a suitable softmax temperature calibration. We further explain why this phenomenon arises by analyzing intermediate layer outputs. Specifically, we introduce layerwise exponentially weighted connectivity (LEWC), which states that the output of each layer of the merged model can be represented as an exponentially weighted sum of the outputs of the corresponding layers of the original models. Consequently the merged model's output matches that of an ensemble of the original models, which facilitates LMC. To the best of our knowledge, this work is the first to show that widening the model not only facilitates nonlinear mode connectivity, as suggested in prior research, but also significantly increases the possibility of achieving linear mode connectivity.

Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity

TL;DR

This work is the first to show that widening the model not only facilitates nonlinear mode connectivity, as suggested in prior research, but also significantly increases the possibility of achieving linear mode connectivity.

Abstract

Recently, Ainsworth et al. empirically demonstrated that, given two independently trained models, applying a parameter permutation that preserves the input-output behavior allows the two models to be connected by a low-loss linear path. When such a path exists, the models are said to achieve linear mode connectivity (LMC). Prior studies, including Ainsworth et al., have reported that achieving LMC requires not only an appropriate permutation search but also sufficiently wide models (e.g., a 32 width multiplier for ResNet-20). This is broadly believed to be because increasing the model width ensures a large enough space of candidate permutations, increasing the chance of finding one that yields LMC. In this work, we empirically demonstrate that, even without any permutations, simply widening the models is sufficient for achieving LMC when using a suitable softmax temperature calibration. We further explain why this phenomenon arises by analyzing intermediate layer outputs. Specifically, we introduce layerwise exponentially weighted connectivity (LEWC), which states that the output of each layer of the merged model can be represented as an exponentially weighted sum of the outputs of the corresponding layers of the original models. Consequently the merged model's output matches that of an ensemble of the original models, which facilitates LMC. To the best of our knowledge, this work is the first to show that widening the model not only facilitates nonlinear mode connectivity, as suggested in prior research, but also significantly increases the possibility of achieving linear mode connectivity.

Paper Structure

This paper contains 41 sections, 5 theorems, 24 equations, 15 figures.

Key Result

Theorem 5.3

For two bias-free models $\bm{\theta}_a$ and $\bm{\theta}_b$, if def:add_ReLU and def:RO hold, then LEWC is satisfied.

Figures (15)

  • Figure 1: Test accuracies of merged models without permutations for different values of the interpolation coefficient $\lambda$. Even in the absence of permutations, increasing the width multiplier enables the merged models to reach accuracy comparable to the original models (corresponding to $\lambda = 0$ and $1$).
  • Figure 2: Test losses of merged models without permutations. \ref{['subfig:test_loss_merged_model']} shows the original loss values, while \ref{['subfig:calibrated_test_loss_merged_model']} shows the values obtained by applying temperature scaling (inverse temperature).
  • Figure 3: Average cosine similarity between $f_\ell(\bm{x}; (\bm{\theta}_a + \bm{\theta}_b)/2)$ and $(f_\ell(\bm{x};\bm{\theta}_a) + f_\ell(\bm{x};\bm{\theta}_b))/2$ for each layer when test data are fed into the models. For the last layer, cosine similarity is computed between the logits. The color of each plot indicates the degree of width expansion. Wider models exhibit higher cosine similarity, making it easier for LEWC to hold.
  • Figure 4: Average cosine similarity between $\sigma((\tilde{\bm{z}}_\ell^{(a)} + \tilde{\bm{z}}_\ell^{(b)})/2)$ and $(\sigma(\tilde{\bm{z}}_\ell^{(a)}) + \sigma(\tilde{\bm{z}}_\ell^{(b)}))/2$, where $\tilde{\bm{z}}_\ell^{(a)}$ and $\tilde{\bm{z}}_\ell^{(b)}$ are the pre-activations of the $\ell$-th layer of two models. Different colors indicate different width expansion factors. The results indicate high cosine similarity for all layers.
  • Figure 5: Histogram of standard deviations of the ReLU inputs in the second hidden layer, relative to zero (i.e. $\sqrt{\mathbb{E}\tilde{\bm{z}}_{\ell,i}^2}$). Here, we present results for MLPs with width $\times 16$, a VGG-11 scaled $\times 16$, and a ResNet-20 scaled $\times 32$. Most dimensions are concentrated in the leftmost bin, indicating that only a few dimensions are active. Results for all layers are shown in \ref{['app:relu_input_std_dev_all_layer']}.
  • ...and 10 more figures

Theorems & Definitions (16)

  • Conjecture 1.1: Permutation invariance, informal
  • Definition 2.1: Loss Barrier Entezari_arxiv_2022
  • Definition 4.1: Layerwise Exponentially Weighted Connectivity
  • Definition 5.1: Weak Additivity for ReLU Activations Zhou_NIPS_2023
  • Definition 5.2: Reciprocal Orthogonality
  • Theorem 5.3
  • Theorem 5.4
  • Definition C.1: Commutativity
  • Theorem D.1
  • proof
  • ...and 6 more