HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

Shuzhang Zhong; Yanfan Sun; Ling Liang; Runsheng Wang; Ru Huang; Meng Li

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, Meng Li

TL;DR

This work introduces HybriMoE, a hybrid CPU-GPU inference framework for efficient MoE execution on memory-constrained devices. It combines dynamic intra-layer scheduling, impact-driven prefetching, and score-aware caching to balance heterogeneous workloads, preload high-impact experts, and retain high-demand experts in cache. The approach yields up to 1.33× speedup in prefill and 1.70× in decode across three MoE-based LLMs, validating its effectiveness on kTransformers and llama.cpp backends. By addressing unstable activation patterns and MoE structure complexity, HybriMoE enables more scalable and responsive MoE inference on edge-like platforms.

Abstract

The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of 1.33$\times$ in the prefill stage and 1.70$\times$ in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: https://github.com/PKU-SEC-Lab/HybriMoE.

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

TL;DR

Abstract

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)