Routing in Sparsely-gated Language Models responds to Context
Stefan Arnold, Marian Fietta, Dilara Yesilbas
TL;DR
This work investigates whether routing decisions in sparsely-gated Mixture-of-Experts transformers respond to contextual information. Using the Switch Transformer in an encoder-decoder setup and varying the total number of experts, the authors measure context sensitivity through similarity- and context-aware datasets, employing Jensen-Shannon Similarity and Spearman correlations. They find that encoder routing is meaningfully shaped by context, with stronger effects as the number of experts grows, while decoder routing remains more variable and less context-dependent. The results demonstrate that contextual cues can refine token-expert assignments beyond token identity, informing MoE design and encouraging broader analysis of linguistic properties on routing behavior.
Abstract
Language Models (LMs) recently incorporate mixture-of-experts layers consisting of a router and a collection of experts to scale up their parameter count given a fixed computational budget. Building on previous efforts indicating that token-expert assignments are predominantly influenced by token identities and positions, we trace routing decisions of similarity-annotated text pairs to evaluate the context sensitivity of learned token-expert assignments. We observe that routing in encoder layers mainly depends on (semantic) associations, but contextual cues provide an additional layer of refinement. Conversely, routing in decoder layers is more variable and markedly less sensitive to context.
