Table of Contents
Fetching ...

Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization

Rizhen Hu, Yuan Cao, Boao Kong, Mou Sun, Kun Yuan

TL;DR

This work tackles expert overlap and routing ambiguity in sparse Mixture-of-Experts transformers by introducing two plug-and-play losses that do not require architectural changes. The intra-layer specialization loss penalizes co-activated experts' activation similarity to promote distinct, complementary specialization, while the cross-layer coupling loss reinforces coherent expert paths across depth by maximizing joint routing probabilities between adjacent layers. Together, they form a closed-loop theory in which tighter routing amplifies specialization and vice versa, while remaining compatible with load balancing. Empirically, the losses yield consistent perplexity improvements, stronger expert discrimination, lower routing entropy, and faster inference due to more stable pathways, across pre-training, fine-tuning, and zero-shot benchmarks, in both vanilla and DeepSeekMoE settings. The results indicate that loss-centric specialization can rival architectural modifications as a scalable, drop-in improvement for MoE-based Transformers.

Abstract

Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap -- redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. While architectural solutions like DeepSeekMoE promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. In this paper, we propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency without modifying router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens, encouraging experts to specialize in complementary knowledge. Second, a cross-layer coupling loss maximizes joint Top-$k$ routing probabilities across adjacent layers, establishing coherent expert pathways through network depth while reinforcing intra-layer expert specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with both the shared-expert architecture in DeepSeekMoE and vanilla top-$k$ MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training, fine-tuning, and zero-shot benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.

Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization

TL;DR

This work tackles expert overlap and routing ambiguity in sparse Mixture-of-Experts transformers by introducing two plug-and-play losses that do not require architectural changes. The intra-layer specialization loss penalizes co-activated experts' activation similarity to promote distinct, complementary specialization, while the cross-layer coupling loss reinforces coherent expert paths across depth by maximizing joint routing probabilities between adjacent layers. Together, they form a closed-loop theory in which tighter routing amplifies specialization and vice versa, while remaining compatible with load balancing. Empirically, the losses yield consistent perplexity improvements, stronger expert discrimination, lower routing entropy, and faster inference due to more stable pathways, across pre-training, fine-tuning, and zero-shot benchmarks, in both vanilla and DeepSeekMoE settings. The results indicate that loss-centric specialization can rival architectural modifications as a scalable, drop-in improvement for MoE-based Transformers.

Abstract

Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap -- redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. While architectural solutions like DeepSeekMoE promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. In this paper, we propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency without modifying router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens, encouraging experts to specialize in complementary knowledge. Second, a cross-layer coupling loss maximizes joint Top- routing probabilities across adjacent layers, establishing coherent expert pathways through network depth while reinforcing intra-layer expert specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with both the shared-expert architecture in DeepSeekMoE and vanilla top- MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training, fine-tuning, and zero-shot benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.
Paper Structure (43 sections, 10 theorems, 76 equations, 9 figures, 14 tables)

This paper contains 43 sections, 10 theorems, 76 equations, 9 figures, 14 tables.

Key Result

Proposition 4.1

For any two activated experts $e, \nu \in \mathbb{A}_i^{(\ell)}$, the cosine similarity between the gradients of the total loss $\mathcal{L}$ with respect to their down-projection matrices satisfies where $z_i^{(\ell,e)}$ and $z_i^{(\ell,\nu)}$ denote the corresponding intermediate activations. (The proof is provided in Appendix appendix: statement 1.)

Figures (9)

  • Figure 1: The perplexity for training a 1.1B model with different regularization. Setup is in Table \ref{['tab:moe-config']}.
  • Figure 2: Conditional activation probabilities between experts in layers 7 and 8 for a 0.4B MoE model. Top: training with only load-balance regularization. Bottom: training with both load-balance and coupling regularization.
  • Figure 3: Training dynamics on the 0.4B MoE model. Left: cross-layer coupling loss $\mathcal{R}_{\mathrm{cp}}$ when training with $\mathcal{L}_{\mathrm{lb,cp}}$ vs. $\mathcal{L}_{\mathrm{lb,cp,sp}}$; adding $\mathcal{R}_{\mathrm{sp}}$ consistently makes $\mathcal{R}_{\mathrm{cp}}$ more negative (stronger coupling). Right: intra-layer specialization loss $\mathcal{R}_{\mathrm{sp}}$ when training with $\mathcal{L}_{\mathrm{lb,sp}}$ vs. $\mathcal{L}_{\mathrm{lb,sp,cp}}$; adding $\mathcal{R}_{\mathrm{cp}}$ consistently reduces $\mathcal{R}_{\mathrm{sp}}$ (stronger specialization).
  • Figure 4: The self-reinforcing cycle illustrating how expert specialization and routing decisiveness amplify one another.
  • Figure 5: Scalability performance with varying number of activated experts ($N$) on medium-sized models.
  • ...and 4 more figures

Theorems & Definitions (19)

  • Proposition 4.1: Activation-gradient alignment
  • Proposition 5.1
  • Proposition 2.1: Proposition \ref{['prop:activation-similarity']}
  • proof
  • Proposition 2.2: Proposition \ref{['lemma2']}
  • proof
  • Theorem 3.1: Weak specialization implies high-probability decisive routing
  • proof
  • Remark 3.2: Why is the oracle gap typically small during training?
  • Corollary 3.3: Entropy bound as a consequence of decisive routing
  • ...and 9 more