Towards an empirical understanding of MoE design choices

Dongyang Fan; Bettina Messmer; Martin Jaggi

Towards an empirical understanding of MoE design choices

Dongyang Fan, Bettina Messmer, Martin Jaggi

TL;DR

The paper empirically analyzes Mixture of Experts design choices by ablating routing unit, Top-1/Top-2 gating, and routing scope on GPT-2 small with MoEs inserted in the FFN. It finds that routing strategy largely drives validation performance, with token-level Layer-wise Top-2 routing surpassing dense baselines under the same total parameters, and sequence-level Layer-wise Top-2 routing matching dense baselines with equivalent active parameters. Surprisingly, learned routers offer little advantage over frozen or random routing, while language-guided routing is not superior to layer-wise learned routing, implying that router weights may be less critical than the routing path diversity and contextual routing. The work also shows sequence-level routing can induce weak topic specialization, whereas token-level routing tends toward syntactic specialization, providing actionable guidance for MoE design in multilingual and cross-domain models. Overall, the findings suggest that practical MoE deployment can leverage simpler routing configurations and still achieve competitive performance, with sequence-level routing offering distinct advantages for topic-aware specialization.

Abstract

In this study, we systematically evaluate the impact of common design choices in Mixture of Experts (MoEs) on validation performance, uncovering distinct influences at token and sequence levels. We also present empirical evidence showing comparable performance between a learned router and a frozen, randomly initialized router, suggesting that learned routing may not be essential. Our study further reveals that Sequence-level routing can result in topic-specific weak expert specialization, in contrast to syntax specialization observed with Token-level routing.

Towards an empirical understanding of MoE design choices

TL;DR

Abstract

Paper Structure (19 sections, 2 equations, 8 figures, 3 tables)

This paper contains 19 sections, 2 equations, 8 figures, 3 tables.

Introduction
Related Works
Design Choices
Expert Specialization
Experiments
Performance Impact of Design Choices
Does Expert collapse hurt?
Does more experts and more activated experts always help?
Does expert specialization exist?
What does a router learn?
Frozen routing.
Language guided routing versus learned Global routing.
The existence of weak experts?
Conclusion
Appendix
...and 4 more sections

Figures (8)

Figure 1: The frequency that each expert is activated during training for every iteration. Left: pretraining without load balancing loss, resulting in a validation perplexity 10.674. Right: with load balancing loss ($\lambda=0.01$), resulting in a validation perplexity 10.667.
Figure 2: Layer-wise expert assignment results. From left to right: (1) pretrained on OpenWebText dataset, evaluated on 6 categories of MMLU dataset; (2) pre-trained on Multilingual Wikipedia dataset, evaluated on 6 categories of MMLU dataset; (3) pre-trained on Multilingual Wikipedia dataset, evaluated on 4 languages from XNLI dataset and X-Stance Dataset.
Figure 3: Validation perplexity versus training iterations.
Figure 4: Expert assignment results when evaluating on MMLU dataset using different pre-trained models on OpenWebText dataset. From left to right: (1) learned routing; (2) frozen routing; (3) random routing.
Figure 5: Left: pre-trained on Multilingual Wikipedia dataset, evaluated on 6 categories of MMLU dataset; Right: pre-trained on Multilingual Wikipedia dataset, evaluated on 4 languages from XNLI dataset and X-Stance Dataset. Layer-wise Token-level Top2 routing is employed in pretraining.
...and 3 more figures

Towards an empirical understanding of MoE design choices

TL;DR

Abstract

Towards an empirical understanding of MoE design choices

Authors

TL;DR

Abstract

Table of Contents

Figures (8)