Table of Contents
Fetching ...

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

Zongle Huang, Lei Zhu, Zongyuan Zhan, Ting Hu, Weikai Mao, Xianzhi Yu, Yongpan Liu, Tianyu Zhang

TL;DR

This work challenges the view that speculative decoding cannot accelerate MoE inference by showing substantial speedups at moderate batch sizes, especially as MoE sparsity increases. It introduces target efficiency to diagnose systemic bottlenecks beyond acceptance rates and develops a practical SD speedup model based on roofline effects, activated-expert counts, and per-expert load. Theoretical analyses align with GPU measurements, yielding up to 2.29x end-to-end speedup on Qwen2-57B-A14B-Instruct and supporting applicability to private serving and memory-constrained deployments. The results provide a lossless acceleration pathway for MoEs and a framework for understanding SD performance across workload and architecture variations.

Abstract

Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less computation. Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss, but it has been considered efficient only for dense models. In this work, we first demonstrate that, under medium batch sizes, MoE surprisingly benefits more from SD than dense models. Furthermore, as MoE becomes sparser -- the prevailing trend in MoE designs -- the batch size range where SD acceleration is expected to be effective becomes broader. To quantitatively understand tradeoffs involved in SD, we develop a reliable modeling based on theoretical analyses. While current SD research primarily focuses on improving acceptance rates of algorithms, changes in workload and model architecture can still lead to degraded SD acceleration even with high acceptance rates. To address this limitation, we introduce a new metric 'target efficiency' that characterizes these effects, thus helping researchers identify system bottlenecks and understand SD acceleration more comprehensively. For scenarios like private serving, this work unveils a new perspective to speed up MoE inference, where existing solutions struggle. Experiments on different GPUs show up to 2.29x speedup for Qwen2-57B-A14B at medium batch sizes and validate our theoretical predictions.

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

TL;DR

This work challenges the view that speculative decoding cannot accelerate MoE inference by showing substantial speedups at moderate batch sizes, especially as MoE sparsity increases. It introduces target efficiency to diagnose systemic bottlenecks beyond acceptance rates and develops a practical SD speedup model based on roofline effects, activated-expert counts, and per-expert load. Theoretical analyses align with GPU measurements, yielding up to 2.29x end-to-end speedup on Qwen2-57B-A14B-Instruct and supporting applicability to private serving and memory-constrained deployments. The results provide a lossless acceleration pathway for MoEs and a framework for understanding SD performance across workload and architecture variations.

Abstract

Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less computation. Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss, but it has been considered efficient only for dense models. In this work, we first demonstrate that, under medium batch sizes, MoE surprisingly benefits more from SD than dense models. Furthermore, as MoE becomes sparser -- the prevailing trend in MoE designs -- the batch size range where SD acceleration is expected to be effective becomes broader. To quantitatively understand tradeoffs involved in SD, we develop a reliable modeling based on theoretical analyses. While current SD research primarily focuses on improving acceptance rates of algorithms, changes in workload and model architecture can still lead to degraded SD acceleration even with high acceptance rates. To address this limitation, we introduce a new metric 'target efficiency' that characterizes these effects, thus helping researchers identify system bottlenecks and understand SD acceleration more comprehensively. For scenarios like private serving, this work unveils a new perspective to speed up MoE inference, where existing solutions struggle. Experiments on different GPUs show up to 2.29x speedup for Qwen2-57B-A14B at medium batch sizes and validate our theoretical predictions.

Paper Structure

This paper contains 28 sections, 12 equations, 28 figures, 2 tables, 1 algorithm.

Figures (28)

  • Figure 1: Activation status and workload of experts. (a) and (b): Comparison between theoretical and actual number of activated experts $N(t)$ on different datasets. (a) is for Deepseek-V2-Lite-Chat ($\rho=6/62$) and (b) is for Qwen1.5-MoE-Chat ($\rho=4/60$). (c): Normalized number of tokens to process per expert ($\overline{T_{exp}}$) versus MoE sparsity ($\rho$) for given input token count $T$.
  • Figure 2: SD speedup (left y-axis) as a function of batch size and corresponding target efficiency values (right y-axis). Across different hardware platforms and MoE models, SD speedup first increases and then decreases, verifying our theoretical predictions. The target efficiency shows consistent trends with final speedup, validating its effectiveness.
  • Figure 3: Comparison of target efficiency: MoE vs dense model.
  • Figure 4: Comparison between GPU results and our modeling for Qwen2-57B-A14B-Instruct with varying sparsity $\rho$ and draft length $\gamma$.
  • Figure 5: SD speedup trends across more settings with individual runs and averages shown.
  • ...and 23 more figures