Table of Contents
Fetching ...

Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding

Zhibin Wang, Zhonghui Zhang, Yuhang Zhou, Zibo Wang, Mo Zhou, Peng Jiang, Weilin Cai, Chengying Huan, Rong Gu, Sheng Zhong, Chen Tian

TL;DR

This work tackles GPU memory bottlenecks in large MoE inference by introducing SpecMoEOff, a system that uses speculative decoding to increase active workload on both CPU and GPU during decoding. It couples a target-model engine with a draft-model engine, augmented by a CPU-based chunked attention verifier, memory-aware draft execution, and an optimizer that blends convex optimization with profiling-based estimation to select hyperparameters. The approach is supported by roofline analysis showing memory-transfer bottlenecks in existing offloading, and experimental results demonstrate up to 2.5x improvements in decode throughput over state-of-the-art MoE offloading methods. The combination of CPU-GPU cooperation, memory-conscious drafting, and automated hyperparameter tuning enables substantially higher throughput for large-scale MoE decoding in practical hardware configurations.

Abstract

Recent advancements in Mixture of Experts (MoE) models have significantly increased their parameter scale as well as model performance. Extensive offloading techniques have been proposed to address the GPU memory limitations of MoE inference. However, due to the I/O bottleneck and sparse computation of MoE models, existing offloading techniques still suffer from low hardware utilization. To fully utilize the hardware resources, we propose SpecMoEOff, which employs the speculative decoding technique to enlarge the workload of each expert. SpecMoEOff orchestrates the GPU and CPU by both theoretical and empirical roofline analysis. In addition, we develop a dedicated CPU chunked attention verification kernel to fit the speculative decoding in offloading scenarios as well as minimizing the additional overhead led by draft models. SpecMoEOff further integrates an optimizer to automatically tune the hyperparameters of speculative decoding under given hardware and workload. Experimental results show that SpecMoEOff achieves up to 2.5x decode throughput improvement over the state-of-the-art MoE offloading techniques.

Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding

TL;DR

This work tackles GPU memory bottlenecks in large MoE inference by introducing SpecMoEOff, a system that uses speculative decoding to increase active workload on both CPU and GPU during decoding. It couples a target-model engine with a draft-model engine, augmented by a CPU-based chunked attention verifier, memory-aware draft execution, and an optimizer that blends convex optimization with profiling-based estimation to select hyperparameters. The approach is supported by roofline analysis showing memory-transfer bottlenecks in existing offloading, and experimental results demonstrate up to 2.5x improvements in decode throughput over state-of-the-art MoE offloading methods. The combination of CPU-GPU cooperation, memory-conscious drafting, and automated hyperparameter tuning enables substantially higher throughput for large-scale MoE decoding in practical hardware configurations.

Abstract

Recent advancements in Mixture of Experts (MoE) models have significantly increased their parameter scale as well as model performance. Extensive offloading techniques have been proposed to address the GPU memory limitations of MoE inference. However, due to the I/O bottleneck and sparse computation of MoE models, existing offloading techniques still suffer from low hardware utilization. To fully utilize the hardware resources, we propose SpecMoEOff, which employs the speculative decoding technique to enlarge the workload of each expert. SpecMoEOff orchestrates the GPU and CPU by both theoretical and empirical roofline analysis. In addition, we develop a dedicated CPU chunked attention verification kernel to fit the speculative decoding in offloading scenarios as well as minimizing the additional overhead led by draft models. SpecMoEOff further integrates an optimizer to automatically tune the hyperparameters of speculative decoding under given hardware and workload. Experimental results show that SpecMoEOff achieves up to 2.5x decode throughput improvement over the state-of-the-art MoE offloading techniques.

Paper Structure

This paper contains 22 sections, 2 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: MoE model, offloading, and speculative decoding.
  • Figure 2: Proportion of cost of three kinds of layers.
  • Figure 3: Hierarchical Roofline Models for Mixtral 8x7B in large batch size on A30 and 4090D instances.
  • Figure 4: System Architecture of SpecMoEOff.
  • Figure 5: SpecMoEOff execution pipeline.
  • ...and 8 more figures