Table of Contents
Fetching ...

Probing Semantic Routing in Large Mixture-of-Expert Models

Matthew Lyle Olson, Neale Ratzlaff, Musashi Hinck, Man Luo, Sungduk Yu, Chendi Xue, Vasudev Lal

TL;DR

The paper investigates whether large MoE models route information semantically rather than purely token-based. By two controlled probes—Word-in-Context (WiC) and lexical substitution (SWORDS)—and a Cohen-like normalized overlap metric, they reveal statistically significant semantic routing across six MoE models from three families, with stronger effects in middle layers and in larger models. A qualitative case study in DiscoveryWorld shows that specific reasoning patterns map to small sets of experts, suggesting emergent cognitive specialization. These findings advance interpretability and open avenues for targeted control and efficiency in sparse MoE deployments.

Abstract

In the past year, large (>100B parameter) mixture-of-expert (MoE) models have become increasingly common in the open domain. While their advantages are often framed in terms of efficiency, prior work has also explored functional differentiation through routing behavior. We investigate whether expert routing in large MoE models is influenced by the semantics of the inputs. To test this, we design two controlled experiments. First, we compare activations on sentence pairs with a shared target word used in the same or different senses. Second, we fix context and substitute the target word with semantically similar or dissimilar alternatives. Comparing expert overlap across these conditions reveals clear, statistically significant evidence of semantic routing in large MoE models.

Probing Semantic Routing in Large Mixture-of-Expert Models

TL;DR

The paper investigates whether large MoE models route information semantically rather than purely token-based. By two controlled probes—Word-in-Context (WiC) and lexical substitution (SWORDS)—and a Cohen-like normalized overlap metric, they reveal statistically significant semantic routing across six MoE models from three families, with stronger effects in middle layers and in larger models. A qualitative case study in DiscoveryWorld shows that specific reasoning patterns map to small sets of experts, suggesting emergent cognitive specialization. These findings advance interpretability and open avenues for targeted control and efficiency in sparse MoE deployments.

Abstract

In the past year, large (>100B parameter) mixture-of-expert (MoE) models have become increasingly common in the open domain. While their advantages are often framed in terms of efficiency, prior work has also explored functional differentiation through routing behavior. We investigate whether expert routing in large MoE models is influenced by the semantics of the inputs. To test this, we design two controlled experiments. First, we compare activations on sentence pairs with a shared target word used in the same or different senses. Second, we fix context and substitute the target word with semantically similar or dissimilar alternatives. Comparing expert overlap across these conditions reveals clear, statistically significant evidence of semantic routing in large MoE models.

Paper Structure

This paper contains 19 sections, 2 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Summary of Experimental Design. We compare expert routing patterns in two controlled experiments. Top: we hold the target word constant, and change the context to either change the meaning of the target word or keep it the same. Bottom: we hold context constant, and substitute the target word for a similar-meaning or different-meaning word.
  • Figure 2: The difference between same sense words and different sense words across models and datasets. We find all models show statistically-significantly higher similarity of expert overlap, for same versus differently sensed words, when compared to a baseline of random.
  • Figure 3: Layer-wise analysis of MoE LLMs. Generally we find a larger change in overlap for the middle layers (e.g., DeepSeek-R1), and lesser for earlier/later layers. Llama models, with only 1 expert, show much noisier behavior, with an interesting spike in overlap for the penultimate layer.
  • Figure 4: Left: identified reasoning tokens of SAE head 15376 (highlights indicate non-zero head activation) on DiscoveryWorld chain of thought generations. This head activates when the model analyzes its hypotheses. Right: tokens from SAE head 12649. This head activates when R1 catches an internal reasoning error.
  • Figure 5: Visual observation in the Reactor Lab environment at step 50.

Theorems & Definitions (1)

  • proof