Expert Divergence Learning for MoE-based Language Models

Jiaang Li; Haibin Chen; Langming Liu; Yujin Yuan; Yadao Wang; Yizhen Zhang; Chengting Yu; Xin Tong; Weidong Zhang; Shilei Liu; Wenbo Su; Bo Zheng

Expert Divergence Learning for MoE-based Language Models

Jiaang Li, Haibin Chen, Langming Liu, Yujin Yuan, Yadao Wang, Yizhen Zhang, Chengting Yu, Xin Tong, Weidong Zhang, Shilei Liu, Wenbo Su, Bo Zheng

TL;DR

Expert Divergence Learning is introduced, a novel pre-training strategy that explicitly encourages functional specialization among experts and effectively mitigates expert homogenization and brings greater functional specialization, all with negligible computational overhead during training.

Abstract

The Mixture-of-Experts (MoE) architecture is a powerful technique for scaling language models, yet it often suffers from expert homogenization, where experts learn redundant functionalities, thereby limiting MoE's full potential. To address this, we introduce Expert Divergence Learning, a novel pre-training strategy that explicitly encourages functional specialization among experts. Our method incorporates a label-driven auxiliary loss that leverages domain labels inherent in pre-training corpora to maximize the Jensen-Shannon Divergence between the expert routing distributions of different data domains. This optimization objective guides the model to develop diverged routing policies for varied domains and closer routing policies for the same domain, which leads to emergent and organized expert specialization. We validate our approach by pre-training MoE models of up to 15 billion parameters from scratch. Experimental results demonstrate that models trained with Expert Divergence Learning not only achieve a lower language modeling loss but also exhibit significant performance improvements across a diverse range of downstream benchmarks. Further analysis confirms that our method effectively mitigates expert homogenization and brings greater functional specialization, all with negligible computational overhead during training.

Expert Divergence Learning for MoE-based Language Models

TL;DR

Abstract

Paper Structure (46 sections, 2 theorems, 20 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 46 sections, 2 theorems, 20 equations, 9 figures, 9 tables, 1 algorithm.

Introduction
Related Works
MoE-based Language Models.
Expert Specialization.
Methodology
Background: MoE Structure and Standard Training Objectives
Expert Divergence Learning
Theoretical Motivation: Finer-Grained Diversity Allocation
Experiments
Experimental Setup
Main Results
Analysis on Expert Specialization
Efficiency Analysis
Conclusion
Derivations for Divergence Decomposition
...and 31 more sections

Key Result

Proposition 1

The total routing divergence in a batch is the sum of its inter-domain and intra-domain components: $D_{\text{total}} = D_{\text{inter}} + D_{\text{intra}}$.

Figures (9)

Figure 1: A conceptual comparison of the standard Load-Balancing Loss and our proposed Expert-Divergence Loss. (a) For a batch of tokens partitioned by domain, an MoE layer generates expert routing distributions. (b) The standard Load-Balancing Loss ($\mathcal{L}_{LB}$) promotes uniformity by operating on the global average of expert distributions. This can lead to homogenization, as experts are trained on indistinct data mixtures. (c) In contrast, our Expert-Divergence Loss ($\mathcal{L}_{ED}$) provides a targeted signal by maximizing the divergence between domain-specific average distributions. This guides distinct experts to train on differentiated data subsets and develop functional specialization.
Figure 2: Comparative analysis of the primary language modeling loss $\mathcal{L}_{LM}$ for the 3B-A0.3B models. This figure contrasts the training performance of the baseline MoE with models trained using our Expert Divergence Loss under various configurations, including different domain schemes (3-class and 49-class) and divergence coefficients ($\beta$).
Figure 3: Increase in perplexity ($\Delta PPL$) after randomly permuting router weights for each layer of the pre-trained 15B-A1.5B models. Higher values indicate greater expert specialization.
Figure 4: Expert activation heatmaps of different domains for representative layers (0, 4, 14) of the 15B-A1.5B models. Each row shows the average expert activation probabilities for a given validation domain. Darker colors indicate more frequent activation.
Figure 5: Expert activation heatmap of all layers in 15B-A1.5B models.
...and 4 more figures

Theorems & Definitions (2)

Proposition 1: Divergence Decomposition
Proposition 2: Synergistic Optimization

Expert Divergence Learning for MoE-based Language Models

TL;DR

Abstract

Expert Divergence Learning for MoE-based Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (2)