Table of Contents
Fetching ...

Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens

Yanpeng Yu, Haiyue Ma, Krish Agarwal, Nicolai Oswald, Qijing Huang, Hugo Linsenmaier, Chunhui Mei, Ritchie Zhao, Ritika Borkar, Bita Darvish Rouhani, David Nellans, Ronny Krashinsky, Anurag Khandelwal

TL;DR

The paper identifies that token-balanced EP routing, effective in compute-bound regimes, can harm MoE decode performance when decoding is memory-bound due to inflated activated-expert counts. It introduces METRO, a token-routing algorithm that minimizes the number of activated experts per GPU, coupled with an all-gather scheme to share global top-k knowledge, achieving near-optimal routing with low overhead. Evaluations on real systems and a simulator show METRO reduces decode latency by up to 22% and increases total token throughput by up to 21%, with up to 4.11x gains in decode throughput at fixed SLOs. The work demonstrates that memory-aware routing can meaningfully improve end-to-end MoE serving, particularly in memory-bound decode phases, and discusses broader applicability to disaggregated deployments and future hardware.

Abstract

Expert Parallelism (EP) permits Mixture of Experts (MoE) models to scale beyond a single GPU. To address load imbalance across GPUs in EP, existing approaches aim to balance the number of tokens each GPU processes. Surprisingly, we find that this objective degrades performance rather than improving it when processing is memory-bound - a common occurrence in MoE serving, especially in the decode phase. Our analysis reveals that balancing the number of tokens processed per GPU increases the number of activated experts, exacerbating memory pressure in the memory-bound regime. We propose Minimum Expert Token ROuting, a novel token-routing algorithm for high-performance expert-parallel MoE serving in the memory-bound regime that balances the number of activated experts per GPU rather than token counts. METRO achieves near-optimal routing quality with minimal computational overhead by jointly optimizing algorithmic efficiency and leveraging the GPU's parallel processing power. To guarantee routing quality, METRO also employs a novel allGather scheme to gather global top-k knowledge, which has minimal overhead compared to conventional allToAll. Our evaluation of METRO against EPLB on both real systems (vLLM over 8 A100 GPUs) and a proprietary simulator (8-16 B200 GPUs) shows that METRO reduces decode latency by 11 - 22%, and total token throughput by 3 - 21% for Qwen3 and DeepSeek-V3 serving, where prefill and decode phases are co-deployed. In addition, by trading latency headroom for throughput, METRO improves decode throughput by up to 4.11x over EPLB at a fixed decode SLO.

Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens

TL;DR

The paper identifies that token-balanced EP routing, effective in compute-bound regimes, can harm MoE decode performance when decoding is memory-bound due to inflated activated-expert counts. It introduces METRO, a token-routing algorithm that minimizes the number of activated experts per GPU, coupled with an all-gather scheme to share global top-k knowledge, achieving near-optimal routing with low overhead. Evaluations on real systems and a simulator show METRO reduces decode latency by up to 22% and increases total token throughput by up to 21%, with up to 4.11x gains in decode throughput at fixed SLOs. The work demonstrates that memory-aware routing can meaningfully improve end-to-end MoE serving, particularly in memory-bound decode phases, and discusses broader applicability to disaggregated deployments and future hardware.

Abstract

Expert Parallelism (EP) permits Mixture of Experts (MoE) models to scale beyond a single GPU. To address load imbalance across GPUs in EP, existing approaches aim to balance the number of tokens each GPU processes. Surprisingly, we find that this objective degrades performance rather than improving it when processing is memory-bound - a common occurrence in MoE serving, especially in the decode phase. Our analysis reveals that balancing the number of tokens processed per GPU increases the number of activated experts, exacerbating memory pressure in the memory-bound regime. We propose Minimum Expert Token ROuting, a novel token-routing algorithm for high-performance expert-parallel MoE serving in the memory-bound regime that balances the number of activated experts per GPU rather than token counts. METRO achieves near-optimal routing quality with minimal computational overhead by jointly optimizing algorithmic efficiency and leveraging the GPU's parallel processing power. To guarantee routing quality, METRO also employs a novel allGather scheme to gather global top-k knowledge, which has minimal overhead compared to conventional allToAll. Our evaluation of METRO against EPLB on both real systems (vLLM over 8 A100 GPUs) and a proprietary simulator (8-16 B200 GPUs) shows that METRO reduces decode latency by 11 - 22%, and total token throughput by 3 - 21% for Qwen3 and DeepSeek-V3 serving, where prefill and decode phases are co-deployed. In addition, by trading latency headroom for throughput, METRO improves decode throughput by up to 4.11x over EPLB at a fixed decode SLO.

Paper Structure

This paper contains 21 sections, 1 theorem, 4 equations, 13 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

Any feasible solution to min-exp-routing either routes tokens only to one replica of any expert, or can be mapped to a solution that does without increasing the objective value.

Figures (13)

  • Figure 1: metro achieves universal (up to $22\%$) performance improvement on both decode latency and total token throughput (prefill-decode co-deployed) over EPLB's token routing across models, datasets, and hardware setups. Results are from both real system evaluation and simulation (§\ref{['ssec:endtoend_perf']}). Replication ratio: $50\%$. Placement algorithm: EPLB.
  • Figure 2: Expert-parallel MoE inference workflow with expert placement and replication, as well as token routing -- the algorithm to dynamically route tokens to expert replicas.
  • Figure 3: DeepSeek-V3 and Qwen3-30B attainable operational intensities VS. FLOPs/byte ratio of H100 and B200 (§\ref{['ssec:mem_bound']}). The former is two orders of magnitude lower than the latter with batch size smaller than $64$ tokens, and $47\%$ - $3.0\times$ lower with a batch size of $1024$ tokens.
  • Figure 4: Example: balancing tokens for token routing doubles the activated experts per GPU and the theoretical runtime in the memory-bound regime, compared to the hypothetical ideal token routing (§\ref{['ssec:balance_experts']}).
  • Figure 5: The performance impact of EPLB on prefill latency (a), decode latency (b), overall token throughput (c), and maximum number of activated experts across GPUs per decode batch (d) for Qwen3-30B qwen3technicalreport on vLLM vllm (§\ref{['ssec:balance_experts']}). Context length: $8K$. Dataset: InstructCoder li2024instructcoderinstructiontuninglarge. EPLB reduces prefill latency by $17\%$ with batch size $32$, but inflates the number of activated experts by $30\%$ with $1.5x$ replication. As a result, the decode latency increases by $14\%$ and the overall token throughput decreases by $10\%$ with $1.5x$ replication. We also find that the prefill phase can be in the memory-bound regime when the batch size is small (i.e., $8$ and $16$ requests per GPU), where EPLB cannot improve its performance.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Lemma 1