Table of Contents
Fetching ...

Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation

Kening Zheng, Wei-Chieh Huang, Jiahao Huo, Zhonghao Li, Henry Peng Zou, Yibo Yan, Xin Zou, Jungang Li, Junzhuo Li, Hanrong Zhang, Xuming Hu, Philip S. Yu

Abstract

Mixture-of-Experts (MoE) models exhibit striking performance disparities across languages, yet the internal mechanisms driving these gaps remain poorly understood. In this work, we conduct a systematic analysis of expert routing patterns in MoE models, revealing a phenomenon we term Language Routing Isolation, in which high- and low-resource languages tend to activate largely disjoint expert sets. Through layer-stratified analysis, we further show that routing patterns exhibit a layer-wise convergence-divergence pattern across model depth. Building on these findings, we propose RISE (Routing Isolation-guided Subnetwork Enhancement), a framework that exploits routing isolation to identify and adapt language-specific expert subnetworks. RISE applies a tripartite selection strategy, using specificity scores to identify language-specific experts in shallow and deep layers and overlap scores to select universal experts in middle layers. By training only the selected subnetwork while freezing all other parameters, RISE substantially improves low-resource language performance while preserving capabilities in other languages. Experiments on 10 languages demonstrate that RISE achieves target-language F1 gains of up to 10.85% with minimal cross-lingual degradation.

Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation

Abstract

Mixture-of-Experts (MoE) models exhibit striking performance disparities across languages, yet the internal mechanisms driving these gaps remain poorly understood. In this work, we conduct a systematic analysis of expert routing patterns in MoE models, revealing a phenomenon we term Language Routing Isolation, in which high- and low-resource languages tend to activate largely disjoint expert sets. Through layer-stratified analysis, we further show that routing patterns exhibit a layer-wise convergence-divergence pattern across model depth. Building on these findings, we propose RISE (Routing Isolation-guided Subnetwork Enhancement), a framework that exploits routing isolation to identify and adapt language-specific expert subnetworks. RISE applies a tripartite selection strategy, using specificity scores to identify language-specific experts in shallow and deep layers and overlap scores to select universal experts in middle layers. By training only the selected subnetwork while freezing all other parameters, RISE substantially improves low-resource language performance while preserving capabilities in other languages. Experiments on 10 languages demonstrate that RISE achieves target-language F1 gains of up to 10.85% with minimal cross-lingual degradation.

Paper Structure

This paper contains 43 sections, 3 theorems, 31 equations, 10 figures, 12 tables, 1 algorithm.

Key Result

Lemma 1

For any expert $i$ at layer $l$, the gradient of the training loss $\mathcal{L}_{\lambda^*}$ on target-language data $\mathcal{D}_{\lambda^*}$ satisfies where $\ell_t(x)$ is the token-level loss contribution. Hence, if expert $i$ is never activated by target-language tokens, i.e., $g_{t,i}^{(l)}(x)=0$ for all $(x,t)$ from $\mathcal{D}_{\lambda^*}$, then $\blacktriangleleft$$\blacktriangleleft$

Figures (10)

  • Figure 1: Comprehensive routing analysis of Qwen3-30B-A3B: global-level (left) expert activation overlap and layer-wise (right) routing similarity with English.
  • Figure 2: Overview of RISE. (a) We first collect routing statistics across multiple languages. (b) Based on layer-aware analysis, we select language-specific experts in shallow/deep layers and cross-lingual shared experts in middle layers. (c) Only the selected experts are trained while keeping all other parameters frozen.
  • Figure 3: Grouped comparison of layer-wise expert subset combinations. w/o means removing the corresponding layer group; Only: retaining only the corresponding layer group.
  • Figure 4: Ablation studies for hyperparameters. (a) Effect of the activation scale factor $\alpha$ in the composite expert selection. (b) Effect of the expert budget allocation ratio across shallow, middle, and deep layers. (c) Effect of the total number of selected experts $K$.
  • Figure 5: Global-level expert activation overlap of Qwen3-30B-A3B and Phi-3.5-MoE across languages in TyDiQA and MGSM.
  • ...and 5 more figures

Theorems & Definitions (6)

  • Lemma 1: Exact gradient isolation
  • proof
  • Theorem 1: Exact invariance under disjoint routing
  • proof
  • Theorem 2: Cross-lingual perturbation bound
  • proof