Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining
Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, Ben Athiwaratkun
TL;DR
The work addresses decode latency in large Mixture-of-Experts transformers by demonstrating that latency scales with the number of activated experts in memory-bound decoding. It introduces Opportunistic Expert Activation (OEA), a two-phase, batch-aware routing that guarantees a per-token quality baseline and then opportunistically piggybacks on already-loaded experts within the same batch to reduce the global active set without retraining. A simple memory-bound model shows latency is driven by the active-expert count $T$, and empirical results on Qwen3-30B and Qwen3-235B at batch size $16$ report latency reductions of up to $39\%$ and $15\%$, respectively, with negligible accuracy loss. The approach is complementary to existing routing and architectural strategies and offers a practical pathway to faster MoE decoding in real-time settings.
Abstract
An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with a batch size of $16$. Without any statistically significant loss in accuracy, our approach achieves latency reductions of $39\%$ and $15\%$ in the MoE layer decode latency, respectively.
