Table of Contents
Fetching ...

Advancing Expert Specialization for Better MoE

Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Sicong Leng, Qimei Cui, Xudong Jiang

TL;DR

This work tackles the conflict between expert specialization and routing uniformity in MoE models caused by auxiliary load-balancing losses. It introduces two gradient-consistent objectives, an orthogonality loss $\\mathcal{L}_{o}$ and a variance loss $\\mathcal{L}_{v}$, which promote distinct expert representations and diversified routing without sacrificing load balancing. The authors provide a theoretical compatibility framework and empirically validate improvements across 11 benchmarks and multiple MoE architectures, achieving up to 23.79% relative gains with no architectural changes. The approach demonstrates that loss-level innovations can unlock MoE efficiency and specialization, enabling better downstream performance in domain-specific settings while maintaining computational efficiency.

Abstract

Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input. However, we observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training. To address this, we propose a simple yet effective solution that introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. Gradient-level analysis demonstrates that these objectives are compatible with the existing auxiliary loss and contribute to optimizing the training process. Experimental results over various model architectures and across multiple benchmarks show that our method significantly enhances expert specialization. Notably, our method improves classic MoE baselines with auxiliary loss by up to 23.79%, while also maintaining load balancing in downstream tasks, without any architectural modifications or additional components. We will release our code to contribute to the community.

Advancing Expert Specialization for Better MoE

TL;DR

This work tackles the conflict between expert specialization and routing uniformity in MoE models caused by auxiliary load-balancing losses. It introduces two gradient-consistent objectives, an orthogonality loss and a variance loss , which promote distinct expert representations and diversified routing without sacrificing load balancing. The authors provide a theoretical compatibility framework and empirically validate improvements across 11 benchmarks and multiple MoE architectures, achieving up to 23.79% relative gains with no architectural changes. The approach demonstrates that loss-level innovations can unlock MoE efficiency and specialization, enabling better downstream performance in domain-specific settings while maintaining computational efficiency.

Abstract

Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input. However, we observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training. To address this, we propose a simple yet effective solution that introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. Gradient-level analysis demonstrates that these objectives are compatible with the existing auxiliary loss and contribute to optimizing the training process. Experimental results over various model architectures and across multiple benchmarks show that our method significantly enhances expert specialization. Notably, our method improves classic MoE baselines with auxiliary loss by up to 23.79%, while also maintaining load balancing in downstream tasks, without any architectural modifications or additional components. We will release our code to contribute to the community.

Paper Structure

This paper contains 45 sections, 6 theorems, 46 equations, 5 figures, 8 tables.

Key Result

Lemma 1

Let $\mathcal{S} \in \mathcal{R}^{N \times n}$ be a matrix that satisfies following conditions: each row sums to 1, each row contains $k$ non-zero elements and $n-k$ zero elements. Then, there always exists a state in which the following two objectives are simultaneously optimized: 1. The sum of the

Figures (5)

  • Figure 1: Two core effects of our method.Left — Routing Diversification:Left-Bottom: after training, scores show higher discrimination than the untrained model. Right-Top: expert load variance decrease after training. Right-Bottom: when training, variance increases markedly, yielding more decisive token-to-expert assignments. Right — Expert Specialization:Cluster Separation: clearer per-expert token clusters emerge after training, evidencing specialization. Overlap: baseline exhibits heavy token-assignment overlap across experts, which our method substantially reduces.
  • Figure 2: Variation of Load Balancing. The figure illustrates the variation of load balancing during training across three distinct models for different methods. Method represents the combination of $\mathcal{L}_{\text{aux}}$, $\mathcal{L}_{\text{o}}$, and $\mathcal{L}_{\text{v}}$; Step denotes the number of training steps; $MaxVio_{\text{global}} \downarrow$ serves as the metric for load balancing; and RMSE is the metric for measuring the similarity between two curves.
  • Figure 3: Behaviors of Experts and Routing. The figure demonstrates the behavioral states of experts and routing across different methods. The first two subplots, Silhouette Coefficient and Expert Overlap, measure the degree of expert orthogonality, while the last subplot, Routing Variance, evaluates the diversity of routing outputs.
  • Figure 4: Ablation Experiments. The figure illustrates the performance differences of different ablation method combinations across three models on various benchmarks. The vertices on the circles represent the corresponding benchmark names, with the same type connected by the same color. The numbers inside the circles denote the accuracy represented by each circle.
  • Figure 5: Selected Images (4×3)

Theorems & Definitions (8)

  • Lemma 1
  • Lemma 2
  • Lemma 1
  • Lemma 2
  • Lemma 1
  • proof C.1
  • Lemma 2
  • proof C.2