Table of Contents
Fetching ...

Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling

Yan Li, Pengfei Zheng, Shuang Chen, Zewei Xu, Yuanhao Lai, Yunfei Du, Zhengang Wang

TL;DR

MoE inference is hampered by heavy EP communication in multi-device setups, limiting throughput under latency constraints. The authors propose Speculative MoE (s-MoE) with online Speculative Token Shuffling (s-TS) and offline Speculative Expert Grouping (s-EG) to forecast token-expert routing and co-dispatch tokens and experts, replacing costly all2all EP traffic with more local, coordinated communication. They formulate token-expert affinity and inter-layer device affinity, solve a balanced token-expert co-clustering problem via a CEO-based algorithm, and implement optimized kernels (SRS/SAG) and de-duplication to realize gains on top of DeepSpeed-MoE and SGLang. Across two equipped testbeds and two MoE models, s-MoE achieves up to 4.3× end-to-end throughput improvements under tight SLOs, with s-TS and s-EG contributing complementary gains, and demonstrates framework-agnostic applicability. The work offers a practical, generalizable approach to dramatically accelerate MoE inference in real-world serving environments.

Abstract

MoE (Mixture of Experts) prevails as a neural architecture that can scale modern transformer-based LLMs (Large Language Models) to unprecedented scales. Nevertheless, large MoEs' great demands of computing power, memory capacity and memory bandwidth make scalable serving a fundamental challenge and efficient parallel inference has become a requisite to attain adequate throughput under latency constraints. DeepSpeed-MoE, one state-of-the-art MoE inference framework, adopts a 3D-parallel paradigm including EP (Expert Parallelism), TP (Tensor Parallel) and DP (Data Parallelism). However, our analysis shows DeepSpeed-MoE's inference efficiency is largely bottlenecked by EP, which is implemented with costly all-to-all collectives to route token activation. Our work aims to boost DeepSpeed-MoE by strategically reducing EP's communication overhead with a technique named Speculative MoE. Speculative MoE has two speculative parallelization schemes, speculative token shuffling and speculative expert grouping, which predict outstanding tokens' expert routing paths and pre-schedule tokens and experts across devices to losslessly trim EP's communication volume. Besides DeepSpeed-MoE, we also build Speculative MoE into a prevailing MoE inference engine SGLang. Experiments show Speculative MoE can significantly boost state-of-the-art MoE inference frameworks on fast homogeneous and slow heterogeneous interconnects.

Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling

TL;DR

MoE inference is hampered by heavy EP communication in multi-device setups, limiting throughput under latency constraints. The authors propose Speculative MoE (s-MoE) with online Speculative Token Shuffling (s-TS) and offline Speculative Expert Grouping (s-EG) to forecast token-expert routing and co-dispatch tokens and experts, replacing costly all2all EP traffic with more local, coordinated communication. They formulate token-expert affinity and inter-layer device affinity, solve a balanced token-expert co-clustering problem via a CEO-based algorithm, and implement optimized kernels (SRS/SAG) and de-duplication to realize gains on top of DeepSpeed-MoE and SGLang. Across two equipped testbeds and two MoE models, s-MoE achieves up to 4.3× end-to-end throughput improvements under tight SLOs, with s-TS and s-EG contributing complementary gains, and demonstrates framework-agnostic applicability. The work offers a practical, generalizable approach to dramatically accelerate MoE inference in real-world serving environments.

Abstract

MoE (Mixture of Experts) prevails as a neural architecture that can scale modern transformer-based LLMs (Large Language Models) to unprecedented scales. Nevertheless, large MoEs' great demands of computing power, memory capacity and memory bandwidth make scalable serving a fundamental challenge and efficient parallel inference has become a requisite to attain adequate throughput under latency constraints. DeepSpeed-MoE, one state-of-the-art MoE inference framework, adopts a 3D-parallel paradigm including EP (Expert Parallelism), TP (Tensor Parallel) and DP (Data Parallelism). However, our analysis shows DeepSpeed-MoE's inference efficiency is largely bottlenecked by EP, which is implemented with costly all-to-all collectives to route token activation. Our work aims to boost DeepSpeed-MoE by strategically reducing EP's communication overhead with a technique named Speculative MoE. Speculative MoE has two speculative parallelization schemes, speculative token shuffling and speculative expert grouping, which predict outstanding tokens' expert routing paths and pre-schedule tokens and experts across devices to losslessly trim EP's communication volume. Besides DeepSpeed-MoE, we also build Speculative MoE into a prevailing MoE inference engine SGLang. Experiments show Speculative MoE can significantly boost state-of-the-art MoE inference frameworks on fast homogeneous and slow heterogeneous interconnects.

Paper Structure

This paper contains 25 sections, 1 equation, 8 figures, 1 table, 2 algorithms.

Figures (8)

  • Figure 1: Latency breakdown for Deepspeed-MoE inference over a single MoE layer. Hardware: 8-GPU (96GB) server with fast inter-GPU network (900GB/s); Model: DeepSeek-V2 236B; Dataset: LongBench; batch size * sequence length: 2K-16K.
  • Figure 2: Example of s-MoE. Compared with DS-MoE (5 tokens in EP shuffling), speculative token shuffling (s-TS) replaces DS-MoE's allreduce with a customized shuffled-reduce-scatter, which reduces EP' shuffling into 3 tokens by pre-collocating tokens with their speculated experts to be route. Furthermore, speculative expert shuffling (s-EG) co-groups semantically similar experts and avoids tokens' dispersed activation, further reducing EP shuffling into 1 token.
  • Figure 3: s-MoE workflow
  • Figure 4: Inference throughput under TTFT, TPOT and p90-TBT latency constraints. DS-MoE: DeepSpeed-MoE. s-MoE: Speculative MoE, s-TS: Speculative Token Shuffling, s-EG: Speculative Expert Grouping. Note we tune parallel configurations and report the one DeepSpeed-MoE performs best.
  • Figure 5: Local activation rate against overall EP overhead.
  • ...and 3 more figures