Towards an empirical understanding of MoE design choices
Dongyang Fan, Bettina Messmer, Martin Jaggi
TL;DR
The paper empirically analyzes Mixture of Experts design choices by ablating routing unit, Top-1/Top-2 gating, and routing scope on GPT-2 small with MoEs inserted in the FFN. It finds that routing strategy largely drives validation performance, with token-level Layer-wise Top-2 routing surpassing dense baselines under the same total parameters, and sequence-level Layer-wise Top-2 routing matching dense baselines with equivalent active parameters. Surprisingly, learned routers offer little advantage over frozen or random routing, while language-guided routing is not superior to layer-wise learned routing, implying that router weights may be less critical than the routing path diversity and contextual routing. The work also shows sequence-level routing can induce weak topic specialization, whereas token-level routing tends toward syntactic specialization, providing actionable guidance for MoE design in multilingual and cross-domain models. Overall, the findings suggest that practical MoE deployment can leverage simpler routing configurations and still achieve competitive performance, with sequence-level routing offering distinct advantages for topic-aware specialization.
Abstract
In this study, we systematically evaluate the impact of common design choices in Mixture of Experts (MoEs) on validation performance, uncovering distinct influences at token and sequence levels. We also present empirical evidence showing comparable performance between a learned router and a frozen, randomly initialized router, suggesting that learned routing may not be essential. Our study further reveals that Sequence-level routing can result in topic-specific weak expert specialization, in contrast to syntax specialization observed with Token-level routing.
