Table of Contents
Fetching ...

Route Experts by Sequence, not by Token

Tiansheng Wen, Yifei Wang, Aosong Feng, Long Ma, Xinyang Liu, Yifan Wang, Lixuan Guo, Bo Chen, Stefanie Jegelka, Chenyu You

TL;DR

The paper tackles inefficiency in MoE routing caused by per-token TopK budgets that over-allocate to easy tokens. It proposes SeqTopK, a minimal, parameter-free shift of the expert budget to the sequence level, selecting the top $K_{\text{seq}}=T K$ scores across a sequence to enable context-aware allocation without increasing compute. The authors introduce Online SeqTopK with an Expert Cache for autoregressive decoding, and provide extensive experiments showing consistent gains over TopK and prior adaptive methods, especially under higher sparsity, with negligible overhead. The work demonstrates improved efficiency and scalability of ultra-sparse MoEs in diverse domains and supports easy integration with pretrained checkpoints and existing MoE frameworks.

Abstract

Mixture-of-Experts (MoE) architectures scale large language models (LLMs) by activating only a subset of experts per token, but the standard TopK routing assigns the same fixed number of experts to all tokens, ignoring their varying complexity. Prior adaptive routing methods introduce additional modules and hyperparameters, often requiring costly retraining from scratch. We propose Sequence-level TopK (SeqTopK), a minimal modification that shifts the expert budget from the token level to the sequence level. By selecting the top $T \cdot K$ experts across all $T$ tokens, SeqTopK enables end-to-end learned dynamic allocation -- assigning more experts to difficult tokens and fewer to easy ones -- while preserving the same overall budget. SeqTopK requires only a few lines of code, adds less than 1% overhead, and remains fully compatible with pretrained MoE models. Experiments across math, coding, law, and writing show consistent improvements over TopK and prior parameter-free adaptive methods, with gains that become substantially larger under higher sparsity (up to 16.9%). These results highlight SeqTopK as a simple, efficient, and scalable routing strategy, particularly well-suited for the extreme sparsity regimes of next-generation LLMs. Code is available at https://github.com/Y-Research-SBU/SeqTopK.

Route Experts by Sequence, not by Token

TL;DR

The paper tackles inefficiency in MoE routing caused by per-token TopK budgets that over-allocate to easy tokens. It proposes SeqTopK, a minimal, parameter-free shift of the expert budget to the sequence level, selecting the top scores across a sequence to enable context-aware allocation without increasing compute. The authors introduce Online SeqTopK with an Expert Cache for autoregressive decoding, and provide extensive experiments showing consistent gains over TopK and prior adaptive methods, especially under higher sparsity, with negligible overhead. The work demonstrates improved efficiency and scalability of ultra-sparse MoEs in diverse domains and supports easy integration with pretrained checkpoints and existing MoE frameworks.

Abstract

Mixture-of-Experts (MoE) architectures scale large language models (LLMs) by activating only a subset of experts per token, but the standard TopK routing assigns the same fixed number of experts to all tokens, ignoring their varying complexity. Prior adaptive routing methods introduce additional modules and hyperparameters, often requiring costly retraining from scratch. We propose Sequence-level TopK (SeqTopK), a minimal modification that shifts the expert budget from the token level to the sequence level. By selecting the top experts across all tokens, SeqTopK enables end-to-end learned dynamic allocation -- assigning more experts to difficult tokens and fewer to easy ones -- while preserving the same overall budget. SeqTopK requires only a few lines of code, adds less than 1% overhead, and remains fully compatible with pretrained MoE models. Experiments across math, coding, law, and writing show consistent improvements over TopK and prior parameter-free adaptive methods, with gains that become substantially larger under higher sparsity (up to 16.9%). These results highlight SeqTopK as a simple, efficient, and scalable routing strategy, particularly well-suited for the extreme sparsity regimes of next-generation LLMs. Code is available at https://github.com/Y-Research-SBU/SeqTopK.

Paper Structure

This paper contains 22 sections, 7 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of our proposed method.(a) Illustrative comparison between standard TopK Routing and SeqTopK Routing in MoE models. Under the same expert budget for a sequence ($T \cdot K = 3 \cdot 2$ in this case), SeqTopK routes experts via comparing expert scores across all tokens in a sequence, enabling dynamic and context-aware allocation of experts (e.g., more experts for hard tokens) via end-to-end training. (b) Performance of fintuned OLMoE-A1B-7B on GSM8K and MBPP datasets. SeqTopK consistently outperforms TopK under different expert budgets ($K = 2, 4, 8$), and the gain is much larger under sparser MoEs.
  • Figure 2: Change of token probability $P({\bm{x}}_t|{\bm{x}}_{<t})$ under varying active experts ($K$).(a) Token probability change vs. token proportion. Given the same prefix, we sample over 10k tokens across different K values, computing probability differences relative to $K=8$ (the original OLMoE setting). About 60% of tokens show little probability change when $K$ is reduced from 8 to 4 or 6, while 10% change dramatically, indicating that different tokens can require quite different numbers of activated experts to predict. (b) & (c) Word clouds of the top 50 tokens with small ($<0.01$) and large ($>0.5$) token probability shifts. Tokens with the larger probability differences are content words that influence semantic direction or topic shifts, whereas tokens with the smaller probability differences are numbers or function words that maintain structure.
  • Figure 3: (a) & (b) Simple PyTorch implementations of TopK and SeqTopK routing. With a minimal modification of TopK routing (highlighted), SeqTopK enables dynamic and context-aware expert allocation by comparing expert scores across all tokens in a sequence.
  • Figure 4: Correlation between token entropy and expert activation. (a) Analysis of 10K tokens generated by fine-tuned Qwen-1.5-A2.7B, showing that higher token entropy -- defined as the entropy of the output distribution at each token, capturing prediction uncertainty and token hardness -- correlates well with a larger number of average activated experts. (b) Illustration from a specific generated sequence, where SeqTopK often activates more experts on high-entropy tokens, such as the "$" symbol marking the start of a math expression.
  • Figure 5: Routing Dynamics of SeqTopK. (a) Layer-wise normalized entropy comparison. Higher entropy means more balanced expert utilization. SeqTopK consistently exhibits higher entropy than TopK, suggesting that its sequence-level routing encourages more uniformed (i.e., balanced) expert utilization. (b) Expert-load histogram at layer 16. SeqTopK presents smoother and more balanced expert utilization compared to TopK.
  • ...and 1 more figures