Table of Contents
Fetching ...

SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference

Liangkun Chen, Zijian Wen, Tian Wu, Xiaoxi Zhang, Chuan Wu

TL;DR

This paper tackles the memory and bandwidth bottlenecks of SD-enabled MoE inference on resource-constrained hardware by introducing SP-MoE, a speculatively-aware expert offloading framework. It combines a cross-model expert predictor with drafting-stage prefetching, a cutoff-layer policy to bound prefetch depth, and a pipelined asynchronous prefetching runtime with batched I/O to hide I/O latency. The key contributions are a novel SD-aware prefetching strategy, a just-in-time cutoff mechanism, and a continuous prefetching pipeline, yielding up to 3.5x TPOT speedups across diverse MoE models and datasets. The approach demonstrates robust gains under memory constraints and varying hardware, enabling practical deployment of SD-accelerated MoE LLMs on consumer to enterprise GPUs. Overall, SP-MoE advances the feasibility and efficiency of large sparse models in latency-sensitive inference scenarios.

Abstract

The Mixture-of-Experts (MoE) architecture has been widely adopted in large language models (LLMs) to reduce computation cost through model sparsity. Employing speculative decoding (SD) can further accelerate MoE inference by drafting multiple tokens per step and verifying them in parallel. However, combining MoE with SD inflates GPU memory and aggravates CPU-GPU bandwidth contention during multi-token verification. Existing MoE offloading systems are SD-agnostic and do not address this bottleneck. We present SP-MoE, the first SD-aware expert-offloading and compute-communication pipelining framework. SP-MoE introduces: (1) speculative expert prefetching that exploits structural correspondence between the draft and target models to prefetch likely experts ahead of verification; (2) a cutoff-layer policy that bounds per-layer prefetch depth based on empirical profiles and an analytical latency model, guaranteeing just-in-time availability without overfetch; and (3) a pipelined runtime with asynchronous prefetch threads and batched I/O to hide loading latency. Extensive experiments demonstrate that SP-MoE achieves a 1.07-3.5 times TPOT speedup over state-of-the-art methods across diverse datasets, environments, and MoE-based models.

SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference

TL;DR

This paper tackles the memory and bandwidth bottlenecks of SD-enabled MoE inference on resource-constrained hardware by introducing SP-MoE, a speculatively-aware expert offloading framework. It combines a cross-model expert predictor with drafting-stage prefetching, a cutoff-layer policy to bound prefetch depth, and a pipelined asynchronous prefetching runtime with batched I/O to hide I/O latency. The key contributions are a novel SD-aware prefetching strategy, a just-in-time cutoff mechanism, and a continuous prefetching pipeline, yielding up to 3.5x TPOT speedups across diverse MoE models and datasets. The approach demonstrates robust gains under memory constraints and varying hardware, enabling practical deployment of SD-accelerated MoE LLMs on consumer to enterprise GPUs. Overall, SP-MoE advances the feasibility and efficiency of large sparse models in latency-sensitive inference scenarios.

Abstract

The Mixture-of-Experts (MoE) architecture has been widely adopted in large language models (LLMs) to reduce computation cost through model sparsity. Employing speculative decoding (SD) can further accelerate MoE inference by drafting multiple tokens per step and verifying them in parallel. However, combining MoE with SD inflates GPU memory and aggravates CPU-GPU bandwidth contention during multi-token verification. Existing MoE offloading systems are SD-agnostic and do not address this bottleneck. We present SP-MoE, the first SD-aware expert-offloading and compute-communication pipelining framework. SP-MoE introduces: (1) speculative expert prefetching that exploits structural correspondence between the draft and target models to prefetch likely experts ahead of verification; (2) a cutoff-layer policy that bounds per-layer prefetch depth based on empirical profiles and an analytical latency model, guaranteeing just-in-time availability without overfetch; and (3) a pipelined runtime with asynchronous prefetch threads and batched I/O to hide loading latency. Extensive experiments demonstrate that SP-MoE achieves a 1.07-3.5 times TPOT speedup over state-of-the-art methods across diverse datasets, environments, and MoE-based models.

Paper Structure

This paper contains 23 sections, 2 equations, 14 figures, 3 tables, 2 algorithms.

Figures (14)

  • Figure 1: MoE-based LLM with speculative decoding
  • Figure 2: Observation I: Neighboring draft tokens exhibit overlapping expert activations, motivating the notion of critical experts. Their strong predictability, especially under gating-based strategies, is key to reducing expert loading overhead.
  • Figure 3: Observation II: Prefetching for lighter layers yields lower eviction rates.
  • Figure 4: Observation III: Latency distribution of a single decode iteration across three models.
  • Figure 5: Pipeline of four representative mechanisms that offload expert parameters. The number of single quotation indicates the number of layer.
  • ...and 9 more figures