Table of Contents
Fetching ...

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, Meng Li

TL;DR

This work introduces HybriMoE, a hybrid CPU-GPU inference framework for efficient MoE execution on memory-constrained devices. It combines dynamic intra-layer scheduling, impact-driven prefetching, and score-aware caching to balance heterogeneous workloads, preload high-impact experts, and retain high-demand experts in cache. The approach yields up to 1.33× speedup in prefill and 1.70× in decode across three MoE-based LLMs, validating its effectiveness on kTransformers and llama.cpp backends. By addressing unstable activation patterns and MoE structure complexity, HybriMoE enables more scalable and responsive MoE inference on edge-like platforms.

Abstract

The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of 1.33$\times$ in the prefill stage and 1.70$\times$ in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: https://github.com/PKU-SEC-Lab/HybriMoE.

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

TL;DR

This work introduces HybriMoE, a hybrid CPU-GPU inference framework for efficient MoE execution on memory-constrained devices. It combines dynamic intra-layer scheduling, impact-driven prefetching, and score-aware caching to balance heterogeneous workloads, preload high-impact experts, and retain high-demand experts in cache. The approach yields up to 1.33× speedup in prefill and 1.70× in decode across three MoE-based LLMs, validating its effectiveness on kTransformers and llama.cpp backends. By addressing unstable activation patterns and MoE structure complexity, HybriMoE enables more scalable and responsive MoE inference on edge-like platforms.

Abstract

The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of 1.33 in the prefill stage and 1.70 in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: https://github.com/PKU-SEC-Lab/HybriMoE.

Paper Structure

This paper contains 26 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Execution timeline of three scenarios. Expert computation time on the GPU remains constant, while CPU execution time increases linearly with workload. The balanced scheduling in (c) achieves improved utilization and reduces overall execution time.
  • Figure 2: An example of MoE architecture with shared and routed experts.
  • Figure 3: (a) Cumulative activation frequency(CDF) for neurons and experts, (b) Reuse probability of experts by score, suggesting cache optimization opportunities, (c) Expert workload distribution of DeepSeek in a prefill forward, (d) Latency of prefill 128 tokens for Qwen2(Q), Mixtral(M) and decode 10 tokens for Mixtral with three existing methods, (e) CPU vs. GPU time for varying numbers of experts at fixed load, with CPU benefiting from overlapping computations. (f) CPU and GPU time across workload sizes.
  • Figure 4: Overview of HybriMoE.
  • Figure 5: An example of hybrid scheduling. The CPU computes the cached expert E while the GPU computes the uncached expert C to achieve better hardware utilization.
  • ...and 4 more figures