Routing in Sparsely-gated Language Models responds to Context

Stefan Arnold; Marian Fietta; Dilara Yesilbas

Routing in Sparsely-gated Language Models responds to Context

Stefan Arnold, Marian Fietta, Dilara Yesilbas

TL;DR

This work investigates whether routing decisions in sparsely-gated Mixture-of-Experts transformers respond to contextual information. Using the Switch Transformer in an encoder-decoder setup and varying the total number of experts, the authors measure context sensitivity through similarity- and context-aware datasets, employing Jensen-Shannon Similarity and Spearman correlations. They find that encoder routing is meaningfully shaped by context, with stronger effects as the number of experts grows, while decoder routing remains more variable and less context-dependent. The results demonstrate that contextual cues can refine token-expert assignments beyond token identity, informing MoE design and encouraging broader analysis of linguistic properties on routing behavior.

Abstract

Language Models (LMs) recently incorporate mixture-of-experts layers consisting of a router and a collection of experts to scale up their parameter count given a fixed computational budget. Building on previous efforts indicating that token-expert assignments are predominantly influenced by token identities and positions, we trace routing decisions of similarity-annotated text pairs to evaluate the context sensitivity of learned token-expert assignments. We observe that routing in encoder layers mainly depends on (semantic) associations, but contextual cues provide an additional layer of refinement. Conversely, routing in decoder layers is more variable and markedly less sensitive to context.

Routing in Sparsely-gated Language Models responds to Context

TL;DR

Abstract

Paper Structure (14 sections, 3 figures, 2 tables)

This paper contains 14 sections, 3 figures, 2 tables.

Introduction
Contribution.
Background
Token Choice.
Expert Choice.
Methodology
Measurements for Similarity.
Measurements for Context.
Findings
Correlation with Similarity
Correlation with Context
Correlation with Ambiguity
Conclusion
Limitation.

Figures (3)

Figure 1: Density estimates for routing similarities of ambiguous words given different and identical contexts. Routing decisions are aggregated across expert configurations.
Figure 2: Layer-wise effect sizes using Cohen's $d$ on the routing similarities of ambiguous words given some context. Routing decisions are aggregated across expert configurations.
Figure 3: Differences in routing similarities for a set of ambiguous words given some context, as a function of the number of unique meanings derived from WordNet.

Routing in Sparsely-gated Language Models responds to Context

TL;DR

Abstract

Routing in Sparsely-gated Language Models responds to Context

Authors

TL;DR

Abstract

Table of Contents

Figures (3)