Table of Contents
Fetching ...

Part-Of-Speech Sensitivity of Routers in Mixture of Experts Models

Elie Antoine, Frédéric Béchet, Philippe Langlais

TL;DR

This work investigates whether routers in Mixture-of-Experts (MoE) language models exhibit Part-of-Speech (POS) sensitivity by analyzing the per-token routing paths across layers in six MoE architectures. It treats the model-integrated routers as probes and trains an MLP to predict POS from the token routing sequence, while quantifying layer-wise specialization via Spec_{POS} and KL-divergence metrics. Key findings show robust POS-specific specialization across models, with distinct clustering of routing paths for major POS categories and strong predictive signals from early-layer routing. The results advance understanding of how linguistic structure is reflected in MoE routing, offering diagnostic insights for interpretability and routing design in large-scale MoE systems.

Abstract

This study investigates the behavior of model-integrated routers in Mixture of Experts (MoE) models, focusing on how tokens are routed based on their linguistic features, specifically Part-of-Speech (POS) tags. The goal is to explore across different MoE architectures whether experts specialize in processing tokens with similar linguistic traits. By analyzing token trajectories across experts and layers, we aim to uncover how MoE models handle linguistic information. Findings from six popular MoE models reveal expert specialization for specific POS categories, with routing paths showing high predictive accuracy for POS, highlighting the value of routing paths in characterizing tokens.

Part-Of-Speech Sensitivity of Routers in Mixture of Experts Models

TL;DR

This work investigates whether routers in Mixture-of-Experts (MoE) language models exhibit Part-of-Speech (POS) sensitivity by analyzing the per-token routing paths across layers in six MoE architectures. It treats the model-integrated routers as probes and trains an MLP to predict POS from the token routing sequence, while quantifying layer-wise specialization via Spec_{POS} and KL-divergence metrics. Key findings show robust POS-specific specialization across models, with distinct clustering of routing paths for major POS categories and strong predictive signals from early-layer routing. The results advance understanding of how linguistic structure is reflected in MoE routing, offering diagnostic insights for interpretability and routing design in large-scale MoE systems.

Abstract

This study investigates the behavior of model-integrated routers in Mixture of Experts (MoE) models, focusing on how tokens are routed based on their linguistic features, specifically Part-of-Speech (POS) tags. The goal is to explore across different MoE architectures whether experts specialize in processing tokens with similar linguistic traits. By analyzing token trajectories across experts and layers, we aim to uncover how MoE models handle linguistic information. Findings from six popular MoE models reveal expert specialization for specific POS categories, with routing paths showing high predictive accuracy for POS, highlighting the value of routing paths in characterizing tokens.

Paper Structure

This paper contains 25 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Example of token routing with 2 of 8 selected experts. For "human_", the path is $[(1,4),(8,3),\ldots,(1,2)]$; for "_ities", it is $[(6,4),(2,5),\ldots,(2,8)]$.
  • Figure 2: 2D-TSNE projection of token path
  • Figure 3: Accuracy of MLP trained on ablated signal per model, removing information from first or last layers.
  • Figure 4: MLP's confusion matrix on the POS for all models
  • Figure 5: KL Divergence Matrices for all Models. Heatmaps showing the KL divergence between expert distributions and uniform distribution at each layer. The x-axis represents the layers, the y-axis represents the experts, and the color scale indicates the magnitude of KL divergence.