Table of Contents
Fetching ...

Generalized Linear Mode Connectivity for Transformers

Alexander Theus, Alessandro Cabodi, Sotiris Anagnostidis, Antonio Orvieto, Sidak Pal Singh, Valentina Boeva

TL;DR

This work tackles the problem that loss landscapes of deep models appear fragmented due to parameter-space symmetries. It introduces a generalized symmetry framework encompassing permutations, semi-permutations, orthogonal, and invertible transformations to align Transformer-based models and uncover linear mode connectivity (LMC). The authors develop weight matching, learned matching, and multi-model merging techniques, including universe matching and learned refinement, to achieve low- or zero-barrier linear interpolation paths between independently trained Vision Transformers and GPT-2 models, even across width differences. The findings reveal that diverse symmetry operations are essential to connecting modern model minima, with practical implications for ensembling, federated learning, continual learning, and robustness, while also highlighting limitations in language-model alignment and opportunities for soft, differentiable relaxations. Overall, the paper demonstrates that Transformer minima reside in connected basins once richer symmetries are accounted for, offering a principled path to cross-architecture and cross-width interpolation.

Abstract

Understanding the geometry of neural network loss landscapes is a central question in deep learning, with implications for generalization and optimization. A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths despite appearing to lie in separate loss basins. However, this is often obscured by symmetries in parameter space -- such as neuron permutations -- which make functionally equivalent models appear dissimilar. Prior work has predominantly focused on neuron reordering through permutations, but such approaches are limited in scope and fail to capture the richer symmetries exhibited by modern architectures such as Transformers. In this work, we introduce a unified framework that captures four symmetry classes -- permutations, semi-permutations, orthogonal transformations, and general invertible maps -- broadening the set of valid reparameterizations and subsuming many previous approaches as special cases. Crucially, this generalization enables, for the first time, the discovery of low- and zero-barrier linear interpolation paths between independently trained Vision Transformers and GPT-2 models. Furthermore, our framework extends beyond pairwise alignment to multi-model and width-heterogeneous settings, enabling alignment across architectures of different sizes. These results reveal deeper structure in the loss landscape and underscore the importance of symmetry-aware analysis for understanding model space geometry.

Generalized Linear Mode Connectivity for Transformers

TL;DR

This work tackles the problem that loss landscapes of deep models appear fragmented due to parameter-space symmetries. It introduces a generalized symmetry framework encompassing permutations, semi-permutations, orthogonal, and invertible transformations to align Transformer-based models and uncover linear mode connectivity (LMC). The authors develop weight matching, learned matching, and multi-model merging techniques, including universe matching and learned refinement, to achieve low- or zero-barrier linear interpolation paths between independently trained Vision Transformers and GPT-2 models, even across width differences. The findings reveal that diverse symmetry operations are essential to connecting modern model minima, with practical implications for ensembling, federated learning, continual learning, and robustness, while also highlighting limitations in language-model alignment and opportunities for soft, differentiable relaxations. Overall, the paper demonstrates that Transformer minima reside in connected basins once richer symmetries are accounted for, offering a principled path to cross-architecture and cross-width interpolation.

Abstract

Understanding the geometry of neural network loss landscapes is a central question in deep learning, with implications for generalization and optimization. A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths despite appearing to lie in separate loss basins. However, this is often obscured by symmetries in parameter space -- such as neuron permutations -- which make functionally equivalent models appear dissimilar. Prior work has predominantly focused on neuron reordering through permutations, but such approaches are limited in scope and fail to capture the richer symmetries exhibited by modern architectures such as Transformers. In this work, we introduce a unified framework that captures four symmetry classes -- permutations, semi-permutations, orthogonal transformations, and general invertible maps -- broadening the set of valid reparameterizations and subsuming many previous approaches as special cases. Crucially, this generalization enables, for the first time, the discovery of low- and zero-barrier linear interpolation paths between independently trained Vision Transformers and GPT-2 models. Furthermore, our framework extends beyond pairwise alignment to multi-model and width-heterogeneous settings, enabling alignment across architectures of different sizes. These results reveal deeper structure in the loss landscape and underscore the importance of symmetry-aware analysis for understanding model space geometry.

Paper Structure

This paper contains 52 sections, 33 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: By considering network symmetries beyond permutations, we can teleport two independently trained Transformers to the same loss basin. $\Theta_B$ is projected into a functionally equivalent representation $\pi(\Theta_B)$.
  • Figure 2: Transformer layer.
  • Figure 3: Transformer layer after projection.
  • Figure 5: Loss along the interpolation path for (a, b) width-homogeneous and (c) width-heterogeneous alignment. In (c) $\Theta_{\downarrow}$ denotes the smaller model (reduced embedding dimension; the reduction ratio is indicated in the label), which is aligned to the larger one ($\Theta_{\uparrow}$, with embedding dimension 512). Learned matching was used for the width-heterogeneous results.
  • Figure 6: Linear mode connectivity surface between three CIFAR-10 models for different alignment strategies. Colors show the relative loss deviation from the linear interpolation baseline across the simplex spanned by $\pi(\Theta_A)$, $\pi(\Theta_B)$, and $\pi(\Theta_C)$. The dashed contour ($\varepsilon = 0$) marks points where the loss equals the linear baseline.
  • ...and 5 more figures