Route Experts by Sequence, not by Token

Tiansheng Wen; Yifei Wang; Aosong Feng; Long Ma; Xinyang Liu; Yifan Wang; Lixuan Guo; Bo Chen; Stefanie Jegelka; Chenyu You

Route Experts by Sequence, not by Token

Tiansheng Wen, Yifei Wang, Aosong Feng, Long Ma, Xinyang Liu, Yifan Wang, Lixuan Guo, Bo Chen, Stefanie Jegelka, Chenyu You

TL;DR

The paper tackles inefficiency in MoE routing caused by per-token TopK budgets that over-allocate to easy tokens. It proposes SeqTopK, a minimal, parameter-free shift of the expert budget to the sequence level, selecting the top $K_{\text{seq}}=T K$ scores across a sequence to enable context-aware allocation without increasing compute. The authors introduce Online SeqTopK with an Expert Cache for autoregressive decoding, and provide extensive experiments showing consistent gains over TopK and prior adaptive methods, especially under higher sparsity, with negligible overhead. The work demonstrates improved efficiency and scalability of ultra-sparse MoEs in diverse domains and supports easy integration with pretrained checkpoints and existing MoE frameworks.

Abstract

Mixture-of-Experts (MoE) architectures scale large language models (LLMs) by activating only a subset of experts per token, but the standard TopK routing assigns the same fixed number of experts to all tokens, ignoring their varying complexity. Prior adaptive routing methods introduce additional modules and hyperparameters, often requiring costly retraining from scratch. We propose Sequence-level TopK (SeqTopK), a minimal modification that shifts the expert budget from the token level to the sequence level. By selecting the top $T \cdot K$ experts across all $T$ tokens, SeqTopK enables end-to-end learned dynamic allocation -- assigning more experts to difficult tokens and fewer to easy ones -- while preserving the same overall budget. SeqTopK requires only a few lines of code, adds less than 1% overhead, and remains fully compatible with pretrained MoE models. Experiments across math, coding, law, and writing show consistent improvements over TopK and prior parameter-free adaptive methods, with gains that become substantially larger under higher sparsity (up to 16.9%). These results highlight SeqTopK as a simple, efficient, and scalable routing strategy, particularly well-suited for the extreme sparsity regimes of next-generation LLMs. Code is available at https://github.com/Y-Research-SBU/SeqTopK.

Route Experts by Sequence, not by Token

TL;DR

scores across a sequence to enable context-aware allocation without increasing compute. The authors introduce Online SeqTopK with an Expert Cache for autoregressive decoding, and provide extensive experiments showing consistent gains over TopK and prior adaptive methods, especially under higher sparsity, with negligible overhead. The work demonstrates improved efficiency and scalability of ultra-sparse MoEs in diverse domains and supports easy integration with pretrained checkpoints and existing MoE frameworks.

Abstract

experts across all

tokens, SeqTopK enables end-to-end learned dynamic allocation -- assigning more experts to difficult tokens and fewer to easy ones -- while preserving the same overall budget. SeqTopK requires only a few lines of code, adds less than 1% overhead, and remains fully compatible with pretrained MoE models. Experiments across math, coding, law, and writing show consistent improvements over TopK and prior parameter-free adaptive methods, with gains that become substantially larger under higher sparsity (up to 16.9%). These results highlight SeqTopK as a simple, efficient, and scalable routing strategy, particularly well-suited for the extreme sparsity regimes of next-generation LLMs. Code is available at https://github.com/Y-Research-SBU/SeqTopK.

Route Experts by Sequence, not by Token

TL;DR

Abstract

Route Experts by Sequence, not by Token

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)