Table of Contents
Fetching ...

On Linear Mode Connectivity of Mixture-of-Experts Architectures

Viet-Hoang Tran, Van Hoan Trinh, Khanh Vinh Bui, Tan M. Nguyen

TL;DR

The paper investigates Linear Mode Connectivity (LMC) in Mixture-of-Experts (MoE) architectures, arguing that permutation symmetries of gating and experts fully account for low-barrier linear connections between independently trained models. It introduces a group-action framework on MoE weight spaces and proves functional equivalence results for both dense and sparse gating, under well-motivated assumptions. A practical two-stage permutation alignment and a weight-matching algorithm are proposed to reveal LMC without data dependence, and the method is validated across dense MoE, SMoE, and DeepSeekMoE variants on vision and language tasks. The findings illuminate the loss landscape of MoEs, showing that aligned permutations place independently trained models in connected optima, with implications for ensemble methods and understanding optimization dynamics in scalable architectures.

Abstract

Linear Mode Connectivity (LMC) is a notable phenomenon in the loss landscapes of neural networks, wherein independently trained models have been observed to be connected--up to permutation symmetries--by linear paths in parameter space along which the loss remains consistently low. This observation challenges classical views of non-convex optimization and has implications for model ensembling, generalization, and our understanding of neural loss geometry. Inspired by recent studies on LMC in standard neural networks, we systematically investigate this phenomenon within Mixture-of-Experts (MoE) architectures--a class of models known for their scalability and computational efficiency, which combine traditional neural networks--referred to as experts--through a learnable gating mechanism. We begin by conducting a comprehensive analysis of both dense and sparse gating regimes, demonstrating that the symmetries inherent to MoE architectures are fully characterized by permutations acting on both the expert components and the gating function. Building on these foundational findings, we propose a matching algorithm that enables alignment between independently trained MoEs, thereby facilitating the discovery of LMC. Finally, we empirically validate the presence of LMC using our proposed algorithm across diverse MoE configurations--including dense, sparse, and shared-expert variants--under a wide range of model settings and datasets of varying scales and modalities. Our results confirm the existence of LMC in MoE architectures and offer fundamental insights into the functional landscape and optimization dynamics of deep learning models.

On Linear Mode Connectivity of Mixture-of-Experts Architectures

TL;DR

The paper investigates Linear Mode Connectivity (LMC) in Mixture-of-Experts (MoE) architectures, arguing that permutation symmetries of gating and experts fully account for low-barrier linear connections between independently trained models. It introduces a group-action framework on MoE weight spaces and proves functional equivalence results for both dense and sparse gating, under well-motivated assumptions. A practical two-stage permutation alignment and a weight-matching algorithm are proposed to reveal LMC without data dependence, and the method is validated across dense MoE, SMoE, and DeepSeekMoE variants on vision and language tasks. The findings illuminate the loss landscape of MoEs, showing that aligned permutations place independently trained models in connected optima, with implications for ensemble methods and understanding optimization dynamics in scalable architectures.

Abstract

Linear Mode Connectivity (LMC) is a notable phenomenon in the loss landscapes of neural networks, wherein independently trained models have been observed to be connected--up to permutation symmetries--by linear paths in parameter space along which the loss remains consistently low. This observation challenges classical views of non-convex optimization and has implications for model ensembling, generalization, and our understanding of neural loss geometry. Inspired by recent studies on LMC in standard neural networks, we systematically investigate this phenomenon within Mixture-of-Experts (MoE) architectures--a class of models known for their scalability and computational efficiency, which combine traditional neural networks--referred to as experts--through a learnable gating mechanism. We begin by conducting a comprehensive analysis of both dense and sparse gating regimes, demonstrating that the symmetries inherent to MoE architectures are fully characterized by permutations acting on both the expert components and the gating function. Building on these foundational findings, we propose a matching algorithm that enables alignment between independently trained MoEs, thereby facilitating the discovery of LMC. Finally, we empirically validate the presence of LMC using our proposed algorithm across diverse MoE configurations--including dense, sparse, and shared-expert variants--under a wide range of model settings and datasets of varying scales and modalities. Our results confirm the existence of LMC in MoE architectures and offer fundamental insights into the functional landscape and optimization dynamics of deep learning models.

Paper Structure

This paper contains 55 sections, 14 theorems, 129 equations, 96 figures, 13 tables, 1 algorithm.

Key Result

Proposition 3.1

The MoE function $\mathcal{D}$ is $G(n)$-invariant under the action of $G(n)$ on its weight space $\Phi(n)$, i.e., $\mathcal{D}(\cdot ; \phi) = \mathcal{D}(\cdot ; g\phi)$.

Figures (96)

  • Figure 1: LMC curves for ViT (subplots 1-3) and GPT-2 (subplot 4) with a 4-expert MoE replacement at the first Transformer layer, on CIFAR-100, ImageNet21k$\rightarrow$CIFAR-100, ImageNet-1k, and One Billion Word datasets, respectively. Plots show consistent low-loss linear interpolation paths between fine-tuned models, indicating strong linear mode connectivity.
  • Figure 2: Loss curves for 12-layer ViT-MoE models with a 4-expert MoE replacement in either the first layer (subplots 1, 3) or the last layer (subplots 2, 4), on CIFAR-10 and CIFAR-100. The curves compare two Expert Order Matching methods across 24 permutations, with Weight Matching applied post-reordering to all permutations. Corresponding accuracy metrics are presented in Figure \ref{['fig:rank_acc']}.
  • Figure 3: Performance degradation in ViT-Base on ImageNet due to FFN reinitialization at different layers.
  • Figure 4: Effect of FFN reinitialization on GPT-2 perplexity across layers on WikiText103.
  • Figure 5: Linear Mode Connectivity for ViT-MoE on MNIST with 1 layer and 2 experts
  • ...and 91 more figures

Theorems & Definitions (38)

  • Proposition 3.1: Weight space invariance of Mixture-of-Experts
  • Proposition 3.2: Weight space invariance of Sparse Mixture-of-Experts
  • Remark 3.3
  • Theorem 4.1: Functional equivalence in Mixture-of-Experts with Dense Gating
  • Theorem 4.2: Functional equivalence in Mixture-of-Experts with Sparse Gating
  • Remark 4.3
  • Proposition A.1
  • proof
  • Proposition A.2
  • proof
  • ...and 28 more