Table of Contents
Fetching ...

DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference

Yujie Zhang, Shivam Aggarwal, Tulika Mitra

TL;DR

DAOP introduces a data-aware MoE inference engine that dynamically offloads non-dominant experts to the CPU and uses prediction-driven pre-calculation to hide CPU-GPU data-transfer latency. By exploiting sequence-specific activation patterns and enabling one-layer-ahead predictions, DAOP achieves parallel CPU-GPU execution, memory-efficient caching, and graceful degradation to preserve accuracy. Extensive experiments on Mixtral and Phi MoE models show substantial speedups (up to $8.20\\times$) and improved energy efficiency over baselines, with minimal accuracy loss across diverse tasks. The approach offers a practical path to deploying large MoE models on memory-constrained devices without model fine-tuning, and the authors provide public code for broader adoption.

Abstract

Mixture-of-Experts (MoE) models, though highly effective for various machine learning tasks, face significant deployment challenges on memory-constrained devices. While GPUs offer fast inference, their limited memory compared to CPUs means not all experts can be stored on the GPU simultaneously, necessitating frequent, costly data transfers from CPU memory, often negating GPU speed advantages. To address this, we present DAOP, an on-device MoE inference engine to optimize parallel GPU-CPU execution. DAOP dynamically allocates experts between CPU and GPU based on per-sequence activation patterns, and selectively pre-calculates predicted experts on CPUs to minimize transfer latency. This approach enables efficient resource utilization across various expert cache ratios while maintaining model accuracy through a novel graceful degradation mechanism. Comprehensive evaluations across various datasets show that DAOP outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.

DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference

TL;DR

DAOP introduces a data-aware MoE inference engine that dynamically offloads non-dominant experts to the CPU and uses prediction-driven pre-calculation to hide CPU-GPU data-transfer latency. By exploiting sequence-specific activation patterns and enabling one-layer-ahead predictions, DAOP achieves parallel CPU-GPU execution, memory-efficient caching, and graceful degradation to preserve accuracy. Extensive experiments on Mixtral and Phi MoE models show substantial speedups (up to ) and improved energy efficiency over baselines, with minimal accuracy loss across diverse tasks. The approach offers a practical path to deploying large MoE models on memory-constrained devices without model fine-tuning, and the authors provide public code for broader adoption.

Abstract

Mixture-of-Experts (MoE) models, though highly effective for various machine learning tasks, face significant deployment challenges on memory-constrained devices. While GPUs offer fast inference, their limited memory compared to CPUs means not all experts can be stored on the GPU simultaneously, necessitating frequent, costly data transfers from CPU memory, often negating GPU speed advantages. To address this, we present DAOP, an on-device MoE inference engine to optimize parallel GPU-CPU execution. DAOP dynamically allocates experts between CPU and GPU based on per-sequence activation patterns, and selectively pre-calculates predicted experts on CPUs to minimize transfer latency. This approach enables efficient resource utilization across various expert cache ratios while maintaining model accuracy through a novel graceful degradation mechanism. Comprehensive evaluations across various datasets show that DAOP outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.
Paper Structure (25 sections, 1 equation, 10 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 1 equation, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Average parameter distribution in Mixtral 8x7B jiang2024mixtral.
  • Figure 2: NVIDIA A6000 GPU specifications.
  • Figure 3: The decoder-only MoE-based LLM inference procedure with top-2 experts activated per token.
  • Figure 4: Layer-wise expert activation pattern of Mixtral 8x7B on dataset C4.
  • Figure 5: Layer-wise expert prediction accuracy for the Mixtral 8x7B model, one layer ahead, during the decode phase.
  • ...and 5 more figures