Table of Contents
Fetching ...

SD-MoE: Spectral Decomposition for Effective Expert Specialization

Ruijun Huang, Fang Dong, Xin Zhang, Hengjie Cao, Zhendong Huang, Anrui Chen, Jixian Zhou, Mengyi Chen, Yifeng Yang, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, Chun Zhang, Li Shang

TL;DR

This work investigates why Mixture-of-Experts (MoE) often fails to realize true expert specialization in large language models by revealing overlapping spectral structure in both parameters and gradients, as well as gating biases toward shared features. It introduces Spectral-Decoupled MoE (SD-MoE), which spectrally decomposes each expert’s parameters into a shared low-rank component and an expert-specific tail, and similarly decomposes gradients to update shared and unique parts independently. Empirical results on Qwen and DeepSeek MoEs show about $3 ext{ extpercent}$ average downstream gains, roughly $30 ext{ extpercent}$ faster training, and a reduction of inter-expert spectral similarity to below $0.1$, while tolerating much larger learning rates (up to $4\times$). SD-MoE incurs only about $5 ext{ extpercent}$ training overhead and remains broadly compatible with existing MoE architectures, offering a practical route to scalable and specialized expert utilization.

Abstract

Mixture-of-Experts (MoE) architectures scale Large Language Models via expert specialization induced by conditional computation. In practice, however, expert specialization often fails: some experts become functionally similar, while others functioning as de facto shared experts, limiting the effective capacity and model performance. In this work, we analysis from a spectral perspective on parameter and gradient spaces, uncover that (1) experts share highly overlapping dominant spectral components in their parameters, (2) dominant gradient subspaces are strongly aligned across experts, driven by ubiquitous low-rank structure in human corpus, and (3) gating mechanisms preferentially route inputs along these dominant directions, further limiting specialization. To address this, we propose Spectral-Decoupled MoE (SD-MoE), which decomposes both parameter and gradient in the spectral space. SD-MoE improves performance across downstream tasks, enables effective expert specialization, incurring minimal additional computation, and can be seamlessly integrated into a wide range of existing MoE architectures, including Qwen and DeepSeek.

SD-MoE: Spectral Decomposition for Effective Expert Specialization

TL;DR

This work investigates why Mixture-of-Experts (MoE) often fails to realize true expert specialization in large language models by revealing overlapping spectral structure in both parameters and gradients, as well as gating biases toward shared features. It introduces Spectral-Decoupled MoE (SD-MoE), which spectrally decomposes each expert’s parameters into a shared low-rank component and an expert-specific tail, and similarly decomposes gradients to update shared and unique parts independently. Empirical results on Qwen and DeepSeek MoEs show about average downstream gains, roughly faster training, and a reduction of inter-expert spectral similarity to below , while tolerating much larger learning rates (up to ). SD-MoE incurs only about training overhead and remains broadly compatible with existing MoE architectures, offering a practical route to scalable and specialized expert utilization.

Abstract

Mixture-of-Experts (MoE) architectures scale Large Language Models via expert specialization induced by conditional computation. In practice, however, expert specialization often fails: some experts become functionally similar, while others functioning as de facto shared experts, limiting the effective capacity and model performance. In this work, we analysis from a spectral perspective on parameter and gradient spaces, uncover that (1) experts share highly overlapping dominant spectral components in their parameters, (2) dominant gradient subspaces are strongly aligned across experts, driven by ubiquitous low-rank structure in human corpus, and (3) gating mechanisms preferentially route inputs along these dominant directions, further limiting specialization. To address this, we propose Spectral-Decoupled MoE (SD-MoE), which decomposes both parameter and gradient in the spectral space. SD-MoE improves performance across downstream tasks, enables effective expert specialization, incurring minimal additional computation, and can be seamlessly integrated into a wide range of existing MoE architectures, including Qwen and DeepSeek.
Paper Structure (25 sections, 18 equations, 24 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 18 equations, 24 figures, 6 tables, 1 algorithm.

Figures (24)

  • Figure 1: (a) The top 1% spectral subspace in DeepSeek model is highly aligned (similarity 0.8) and carries $>$30% of the energy ethayarajh2019contextualpuccetti2022outliercao2025metis, while the remaining 99% tail is weakly aligned ($\sim$0.1). (b--d) Pairwise spectral similarity of expert parameters for (b) Qwen1.5-MoE qwen_moe, (c) DeepSeek-V2-Light dai2024deepseekmoe, and (d) SD-MoE. Existing MoE models exhibit strong overlap in the top 1% spectral subspace (avg. 0.7; some $>$0.9), whereas SD-MoE reduces this similarity to $\sim$0.1. Supplementary results in more models and layers are in Appendix \ref{['append:more_results']} Figure \ref{['fig:append.all_expert_sigvals']}-\ref{['fig:append.all_expert_prin_sim_ds']}.
  • Figure 2: Analysis of Qwen1.5-MoE-A2.7B. (a) Pair-wise principal similarity of the dominant low-rank subspace from the gradient matrices of all experts. High values indicate near-identical spectral directions across experts, suggesting gradient similarity in their low-rank subspace. (b) The row vectors of the gating matrix exhibit high alignment with the leading singular directions of the expert weight matrices, indicating that the gating mechanism is dominated by common information.
  • Figure 3: (a) Activation ratio of each singular direction, defined as the fraction of tokens whose projections exceed the activation threshold. Leading singular directions are activated by nearly all tokens, indicating that they encode features consistently present across tokens and domains. (b) After shuffling input tokens, the activation projections onto the top $1\%$ singular directions changes larger than that for the tail $99\%$, indicating that the dominant subspace is highly sensitive to syntactic structure.
  • Figure 4: Gradient spectral analysis of experts in Qwen1.5-MoE-A2.7B. Pairwise principal similarity of (a) the dominant low-rank (1%) gradient subspace and (b) the remaining long-tail subspace across all experts, including the common expert. High similarity in the low-rank subspace indicates shared gradient directions across experts, while tail similarity is substantially weaker. Additional results across models and layers are provided in Appendix \ref{['append:more_results']} (Figures \ref{['fig:append.all_expert_grad_prin_sim']} and \ref{['fig:append.all_expert_grad_tail_sim']}).
  • Figure 5: Pairwise principal subspace similarity of the top $1\%$ activation spectral directions across data samples, confirming the presence of a shared low-rank input subspace. Additional results across models and layers are provided in Appendix \ref{['append:more_results']} (Figure \ref{['fig:append.all_data_act_alignment']}).
  • ...and 19 more figures