Table of Contents
Fetching ...

MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

Wenfeng Wang, Jiacheng Liu, Xiaofeng Hou, Xinfeng Xia, Peng Tang, Mingxuan Zhang, Chao Li, Minyi Guo

TL;DR

MoE-SpeQ tackles the data-movement bottleneck in large Mixture-of-Experts inference by combining a small, on-device INT4 draft model with a lookahead-driven Expert Scheduler and an adaptive Speculative Governor guided by an Amortization Roofline Model. The system prefetches and co-manages expert parameters to hide PCIe latency behind productive computation, achieving up to 2.34x end-to-end speedups and substantial memory savings through parameter and KV-cache sharing plus a fused MoE draft kernel. Key innovations include the Expert Lookahead Buffer (ELB), hierarchical cache priming and prefetching phases, and online optimization that respects latency SLOs. The approach demonstrates strong performance across multiple MoE architectures and memory budgets, enabling practical MoE inference on commodity hardware and providing a principled framework for data-dependent memory management in constrained environments.

Abstract

The immense memory requirements of state-of-the-art Mixture-of-Experts (MoE) models present a significant challenge for inference, often exceeding the capacity of a single accelerator. While offloading experts to host memory is a common solution, it introduces a severe I/O bottleneck over the PCIe bus, as the data-dependent nature of expert selection places these synchronous transfers directly on the critical path of execution, crippling performance. This paper argues that the I/O bottleneck can be overcome by trading a small amount of cheap, on-device computation to hide the immense cost of data movement. We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading. MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens. This foresight enables a runtime orchestrator to prefetch these experts from host memory, effectively overlapping the expensive I/O with useful computation and hiding the latency from the critical path. To maximize performance, an adaptive governor, guided by an Amortization Roofline Model, dynamically tunes the speculation strategy to the underlying hardware. Our evaluation on memory-constrained devices shows that for the Phi-MoE model, MoE-SpeQ achieves at most 2.34x speedup over the state-of-the-art offloading framework. Our work establishes a new, principled approach for managing data-dependent memory access in resource-limited environments, making MoE inference more accessible on commodity hardware.

MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

TL;DR

MoE-SpeQ tackles the data-movement bottleneck in large Mixture-of-Experts inference by combining a small, on-device INT4 draft model with a lookahead-driven Expert Scheduler and an adaptive Speculative Governor guided by an Amortization Roofline Model. The system prefetches and co-manages expert parameters to hide PCIe latency behind productive computation, achieving up to 2.34x end-to-end speedups and substantial memory savings through parameter and KV-cache sharing plus a fused MoE draft kernel. Key innovations include the Expert Lookahead Buffer (ELB), hierarchical cache priming and prefetching phases, and online optimization that respects latency SLOs. The approach demonstrates strong performance across multiple MoE architectures and memory budgets, enabling practical MoE inference on commodity hardware and providing a principled framework for data-dependent memory management in constrained environments.

Abstract

The immense memory requirements of state-of-the-art Mixture-of-Experts (MoE) models present a significant challenge for inference, often exceeding the capacity of a single accelerator. While offloading experts to host memory is a common solution, it introduces a severe I/O bottleneck over the PCIe bus, as the data-dependent nature of expert selection places these synchronous transfers directly on the critical path of execution, crippling performance. This paper argues that the I/O bottleneck can be overcome by trading a small amount of cheap, on-device computation to hide the immense cost of data movement. We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading. MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens. This foresight enables a runtime orchestrator to prefetch these experts from host memory, effectively overlapping the expensive I/O with useful computation and hiding the latency from the critical path. To maximize performance, an adaptive governor, guided by an Amortization Roofline Model, dynamically tunes the speculation strategy to the underlying hardware. Our evaluation on memory-constrained devices shows that for the Phi-MoE model, MoE-SpeQ achieves at most 2.34x speedup over the state-of-the-art offloading framework. Our work establishes a new, principled approach for managing data-dependent memory access in resource-limited environments, making MoE inference more accessible on commodity hardware.

Paper Structure

This paper contains 41 sections, 5 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Comparison of execution timelines. (a) The baseline is dominated by I/O stalls. (b) Our approach utilizes the initial I/O latency to perform speculative draft generation, effectively hiding latency and maximizing GPU utilization.
  • Figure 2: Speculative decoding in MoE.
  • Figure 3: Performance comparison of decoding timelines. Speculative decoding provides a clear benefit for dense models by amortizing verification costs. For MoE models, however, the verification overhead becomes substantial, leading to performance degradation.
  • Figure 4: Latency breakdown for an inference step using offloading mechanism with Transformers on A100-PCIE-40G. GPU computation accounts for less than 15% of the total time, with the vast majority spent stalled on PCIe transfers.
  • Figure 5: Expert activation in Qwen-1.5MoE is highly diverse and non-uniform, reflected in (a) unbalanced activation counts per expert, and (b) consistently high activation entropy across layers.
  • ...and 8 more figures