Table of Contents
Fetching ...

Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining

Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, Ben Athiwaratkun

TL;DR

The work addresses decode latency in large Mixture-of-Experts transformers by demonstrating that latency scales with the number of activated experts in memory-bound decoding. It introduces Opportunistic Expert Activation (OEA), a two-phase, batch-aware routing that guarantees a per-token quality baseline and then opportunistically piggybacks on already-loaded experts within the same batch to reduce the global active set without retraining. A simple memory-bound model shows latency is driven by the active-expert count $T$, and empirical results on Qwen3-30B and Qwen3-235B at batch size $16$ report latency reductions of up to $39\%$ and $15\%$, respectively, with negligible accuracy loss. The approach is complementary to existing routing and architectural strategies and offers a practical pathway to faster MoE decoding in real-time settings.

Abstract

An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with a batch size of $16$. Without any statistically significant loss in accuracy, our approach achieves latency reductions of $39\%$ and $15\%$ in the MoE layer decode latency, respectively.

Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining

TL;DR

The work addresses decode latency in large Mixture-of-Experts transformers by demonstrating that latency scales with the number of activated experts in memory-bound decoding. It introduces Opportunistic Expert Activation (OEA), a two-phase, batch-aware routing that guarantees a per-token quality baseline and then opportunistically piggybacks on already-loaded experts within the same batch to reduce the global active set without retraining. A simple memory-bound model shows latency is driven by the active-expert count , and empirical results on Qwen3-30B and Qwen3-235B at batch size report latency reductions of up to and , respectively, with negligible accuracy loss. The approach is complementary to existing routing and architectural strategies and offers a practical pathway to faster MoE decoding in real-time settings.

Abstract

An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with a batch size of . Without any statistically significant loss in accuracy, our approach achieves latency reductions of and in the MoE layer decode latency, respectively.

Paper Structure

This paper contains 48 sections, 4 equations, 9 figures, 10 tables, 2 algorithms.

Figures (9)

  • Figure 1: Mean MoE latency as a function of the number of activated experts within a decode batch. The average is computed over all layers and decode steps across a GPQA evaluation of the vanilla Qwen3-30B-A3B model.
  • Figure 2: The y-axis shows the cross-entropy delta relative to the baseline (lower left is better). The two types of dots correspond to the Pareto frontiers of pruned and OEA experiments at batch size $B=16$. OEA consistently performs better.
  • Figure 3: The y-axis shows the cross-entropy delta relative to the baseline (lower left is better). The two types of dots correspond to the Pareto frontiers of simplified OEA and the rest of experiments at batch size $B=16$. Simplified OEA performs comparably to the best hyperparameter choices.
  • Figure 4: Mean MoE latency as a function of the number of activated experts within a decode batch. The average is computed over all layers and decode steps across a GPQA evaluation of the vanilla Qwen3-235B-A22B model (under a tensor parallel degree of $8$).
  • Figure 5: The y-axis shows the cross-entropy delta relative to the baseline (lower left is better). The two types of dots correspond to the Pareto frontiers of pruned and OEA experiments at all batch-sizes $B$. OEA consistently performs better.
  • ...and 4 more figures