Table of Contents
Fetching ...

Multilingual Routing in Mixture-of-Experts

Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng

TL;DR

This work probes how sparse MoE routing handles multilingual data, showing language-specific routing in the input and output layers but language-universal routing in middle layers. By analyzing parallel multilingual data, the authors reveal a strong link between a language's performance and how its tokens align with English routing in middle layers. They demonstrate causality through inference-time interventions that steer routers toward English-preferred experts, yielding consistent 1–2% multilingual gains across multiple models and languages. The findings highlight a modular division between language-specific and language-universal parameterization and motivate training-time strategies to enhance cross-lingual routing alignment for improved multilingual generalization.

Abstract

Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.

Multilingual Routing in Mixture-of-Experts

TL;DR

This work probes how sparse MoE routing handles multilingual data, showing language-specific routing in the input and output layers but language-universal routing in middle layers. By analyzing parallel multilingual data, the authors reveal a strong link between a language's performance and how its tokens align with English routing in middle layers. They demonstrate causality through inference-time interventions that steer routers toward English-preferred experts, yielding consistent 1–2% multilingual gains across multiple models and languages. The findings highlight a modular division between language-specific and language-universal parameterization and motivate training-time strategies to enhance cross-lingual routing alignment for improved multilingual generalization.

Abstract

Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.

Paper Structure

This paper contains 26 sections, 8 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Visualization of the typical divergence in MoE routing weights across model layers between English and a high-, medium-, and low-resource language. There is consistently lower divergence in the middle layers, where experts are shared across languages. Languages the model does not understand (e.g. Bambara) fail to leverage similar experts as top languages. In this work, we also present a steering method that activates similar experts to English (red arrows) and results in improved multilingual generalization (e.g. an increase in MGSM-Bengali from $0.776$ to $0.824$).
  • Figure 2: Visualization with more languages of routing divergence from English across model layers based on Qwen3-30B-A3B, where the U-shape can be seen for all. Each line is colored by how well the model understands that language (Belebele accuracy), highlighting a strong correlation between the two. We label a few notable plotted languages, but provide the same graph (along with 3 more models) colored to better distinguish languages in Appendix \ref{['divergenceplots']}.
  • Figure 3: Routing Entropy per Layer for OLMoE.
  • Figure 4: Token routing consistency (within a sequence), across layers in Phi-3.5-MoE.
  • Figure 5: Plot of the Number of Identified Experts per Layer, with $\tau=0.3$ for Qwen3. The red vertical bars delimit the region in which we intervene.
  • ...and 9 more figures