Table of Contents
Fetching ...

Analyzing the Role of Permutation Invariance in Linear Mode Connectivity

Keyao Zhan, Puheng Li, Lei Wu

TL;DR

This work analyzes permutation invariance in linear mode connectivity (LMC) for two-layer ReLU networks under a teacher–student setup. It proves that the LMC barrier after applying the optimal permutation decays as $O(m^{-1/2})$ independent of input dimension, and reveals a peak-barrier and a double-descent pattern as the student width $m$ increases, with a minimum barrier near $m=2M$. The study also uncovers a learning-rate–driven sparsity transition in GD/SGD solutions, which improves permutation matching and further reduces the barrier, with empirical support on synthetic data, MNIST, and deeper architectures. These results illuminate how width, sparsity, and permutation interact to shape loss landscapes and have implications for model merging and ensemble methods.

Abstract

It was empirically observed in Entezari et al. (2021) that when accounting for the permutation invariance of neural networks, there is likely no loss barrier along the linear interpolation between two SGD solutions -- a phenomenon known as linear mode connectivity (LMC) modulo permutation. This phenomenon has sparked significant attention due to both its theoretical interest and practical relevance in applications such as model merging. In this paper, we provide a fine-grained analysis of this phenomenon for two-layer ReLU networks under a teacher-student setup. We show that as the student network width $m$ increases, the LMC loss barrier modulo permutation exhibits a double descent behavior. Particularly, when $m$ is sufficiently large, the barrier decreases to zero at a rate $O(m^{-1/2})$. Notably, this rate does not suffer from the curse of dimensionality and demonstrates how substantial permutation can reduce the LMC loss barrier. Moreover, we observe a sharp transition in the sparsity of GD/SGD solutions when increasing the learning rate and investigate how this sparsity preference affects the LMC loss barrier modulo permutation. Experiments on both synthetic and MNIST datasets corroborate our theoretical predictions and reveal a similar trend for more complex network architectures.

Analyzing the Role of Permutation Invariance in Linear Mode Connectivity

TL;DR

This work analyzes permutation invariance in linear mode connectivity (LMC) for two-layer ReLU networks under a teacher–student setup. It proves that the LMC barrier after applying the optimal permutation decays as independent of input dimension, and reveals a peak-barrier and a double-descent pattern as the student width increases, with a minimum barrier near . The study also uncovers a learning-rate–driven sparsity transition in GD/SGD solutions, which improves permutation matching and further reduces the barrier, with empirical support on synthetic data, MNIST, and deeper architectures. These results illuminate how width, sparsity, and permutation interact to shape loss landscapes and have implications for model merging and ensemble methods.

Abstract

It was empirically observed in Entezari et al. (2021) that when accounting for the permutation invariance of neural networks, there is likely no loss barrier along the linear interpolation between two SGD solutions -- a phenomenon known as linear mode connectivity (LMC) modulo permutation. This phenomenon has sparked significant attention due to both its theoretical interest and practical relevance in applications such as model merging. In this paper, we provide a fine-grained analysis of this phenomenon for two-layer ReLU networks under a teacher-student setup. We show that as the student network width increases, the LMC loss barrier modulo permutation exhibits a double descent behavior. Particularly, when is sufficiently large, the barrier decreases to zero at a rate . Notably, this rate does not suffer from the curse of dimensionality and demonstrates how substantial permutation can reduce the LMC loss barrier. Moreover, we observe a sharp transition in the sparsity of GD/SGD solutions when increasing the learning rate and investigate how this sparsity preference affects the LMC loss barrier modulo permutation. Experiments on both synthetic and MNIST datasets corroborate our theoretical predictions and reveal a similar trend for more complex network architectures.

Paper Structure

This paper contains 17 sections, 4 theorems, 28 equations, 25 figures, 1 algorithm.

Key Result

Lemma 2

Suppose that $m \geqslant M$. Let $S_0=\left\{(0, \ldots, 0) \in \mathbb{R}^d\right\}, S_j=\left\{\alpha e_j: \alpha > 0\right\}$ for $j \in[M]$, and $S=\cup_{j=0}^M S_j$. Then $\mathcal{M}$ is compact and can be analytically characterized as follows

Figures (25)

  • Figure 1: $M=6$
  • Figure 2: $M=20$
  • Figure 4: The log barrier curve for SGD solutions and uniformly sampled solutions. The number of teacher neurons $M = 6$, dimension is $d = 8$, and the number of student neurons $m$ is varied from 7 to 36. Each data point is an average of 20 independent realizations.
  • Figure 5: The normalized log barrier curve for uniformly sampled solutions. The barrier for direct linear interpolation in each setting with different $M$ is normalized to 1, and we plot the relative barrier for permuted solutions with different numbers of teacher neurons $M = 4,20,100,500$. $x$-axis is $m/M$ and $y$-axis represents normalized barrier = $\text{Barrier}_{\text{Permuted}}/\text{Barrier}_{\text{Direct}}$. Each data point is an average of 50 independent realizations.
  • Figure 6: The double descent phenomenon for LMC modulo permutation. Barrier as a function of the number of student neurons $m$. The first descent appears as $m$ approaches $M$ (under-realization regime), and the second descent occurs as $m$ exceeds $2M$, illustrating the "double descent" phenomenon. Note that when $m=M$, the student neurons can always match teachers and thus barrier is 0.
  • ...and 20 more figures

Theorems & Definitions (6)

  • Lemma 2
  • Definition 4
  • Theorem 5
  • Theorem 6
  • Definition 7
  • Lemma 8