Part-Of-Speech Sensitivity of Routers in Mixture of Experts Models
Elie Antoine, Frédéric Béchet, Philippe Langlais
TL;DR
This work investigates whether routers in Mixture-of-Experts (MoE) language models exhibit Part-of-Speech (POS) sensitivity by analyzing the per-token routing paths across layers in six MoE architectures. It treats the model-integrated routers as probes and trains an MLP to predict POS from the token routing sequence, while quantifying layer-wise specialization via Spec_{POS} and KL-divergence metrics. Key findings show robust POS-specific specialization across models, with distinct clustering of routing paths for major POS categories and strong predictive signals from early-layer routing. The results advance understanding of how linguistic structure is reflected in MoE routing, offering diagnostic insights for interpretability and routing design in large-scale MoE systems.
Abstract
This study investigates the behavior of model-integrated routers in Mixture of Experts (MoE) models, focusing on how tokens are routed based on their linguistic features, specifically Part-of-Speech (POS) tags. The goal is to explore across different MoE architectures whether experts specialize in processing tokens with similar linguistic traits. By analyzing token trajectories across experts and layers, we aim to uncover how MoE models handle linguistic information. Findings from six popular MoE models reveal expert specialization for specific POS categories, with routing paths showing high predictive accuracy for POS, highlighting the value of routing paths in characterizing tokens.
