Table of Contents
Fetching ...

Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems

En-Ming Huang, Li-Shang Lin, Chun-Yi Lee

TL;DR

This work tackles the memory bottleneck of MoE-based LLMs on consumer GPUs by introducing a CPU-GPU collaborative inference framework that caches a subset of MoE experts on the GPU and offloads non-cached computations to the CPU with asynchronous data transfers. It leverages patterns of expert reuse to drive an LRUs-based caching strategy and employs an efficient two-stream transfer mechanism to overlap computation and communication. Across Mixtral 8x7B and Phi3.5-MoE, the method achieves substantial speedups (up to 4.4x) and energy savings without model modifications, demonstrating practical viability on memory-constrained hardware. The approach stands out against baselines by fully utilizing CPU multi-core parallelism and GPU memory as a cache, making MoE inference more accessible on consumer-grade systems.

Abstract

Large Language Models (LLMs) have achieved impressive results across various tasks, yet their high computational demands pose deployment challenges, especially on consumer-grade hardware. Mixture of Experts (MoE) models provide an efficient solution through selective activation of parameter subsets, which reduces computation requirements. Despite this efficiency, state-of-the-art MoE models still require substantial memory beyond typical consumer GPU capacities. Traditional offloading methods that transfer model weights between CPU and GPU introduce latency, limiting inference performance. This paper presents a novel CPU-GPU collaborative inference framework that incorporates an expert caching mechanism on the GPU to reduce data transfer requirements and enable faster inference through cache hits. Computations are offloaded to CPU for efficient cache miss handling, which benefits from CPU multithreading optimizations. The evaluations of our framework demonstrate performance improvements and highlight the potential of CPU-GPU collaboration to maximize hardware utilization for single-request inference scenarios on consumer-grade systems. The implementation of our framework is available at https://github.com/elsa-lab/MoE-CPU-GPU-Collaborative-Inference.

Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems

TL;DR

This work tackles the memory bottleneck of MoE-based LLMs on consumer GPUs by introducing a CPU-GPU collaborative inference framework that caches a subset of MoE experts on the GPU and offloads non-cached computations to the CPU with asynchronous data transfers. It leverages patterns of expert reuse to drive an LRUs-based caching strategy and employs an efficient two-stream transfer mechanism to overlap computation and communication. Across Mixtral 8x7B and Phi3.5-MoE, the method achieves substantial speedups (up to 4.4x) and energy savings without model modifications, demonstrating practical viability on memory-constrained hardware. The approach stands out against baselines by fully utilizing CPU multi-core parallelism and GPU memory as a cache, making MoE inference more accessible on consumer-grade systems.

Abstract

Large Language Models (LLMs) have achieved impressive results across various tasks, yet their high computational demands pose deployment challenges, especially on consumer-grade hardware. Mixture of Experts (MoE) models provide an efficient solution through selective activation of parameter subsets, which reduces computation requirements. Despite this efficiency, state-of-the-art MoE models still require substantial memory beyond typical consumer GPU capacities. Traditional offloading methods that transfer model weights between CPU and GPU introduce latency, limiting inference performance. This paper presents a novel CPU-GPU collaborative inference framework that incorporates an expert caching mechanism on the GPU to reduce data transfer requirements and enable faster inference through cache hits. Computations are offloaded to CPU for efficient cache miss handling, which benefits from CPU multithreading optimizations. The evaluations of our framework demonstrate performance improvements and highlight the potential of CPU-GPU collaboration to maximize hardware utilization for single-request inference scenarios on consumer-grade systems. The implementation of our framework is available at https://github.com/elsa-lab/MoE-CPU-GPU-Collaborative-Inference.

Paper Structure

This paper contains 16 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of dense FFN and MoE layer architectures.
  • Figure 2: Mixtral 8x7B (32 Transformer blocks) expert selection patterns on the MMLU hendryckstest2021hendrycks2021ethics dataset: (1) Consecutive Layers Pattern and (2) Consecutive Token Pattern. Each Transformer block's router demonstrates clear patterns of expert reuse.
  • Figure 3: Mixtral 8x7B token generation speed vs. CPU cores.
  • Figure 4: (a) Our workflow, and (b) timing comparison. Workflow includes cache checks, execution based on cache status ((a) on GPU if cache hit, (b) on CPU if cache miss), and asynchronous data transfer to update the GPU cache for future token generation. For layers beyond the cache coverage (e.g., Layer $Z$), all expert computations are done on the CPU.
  • Figure 5: Overall performance comparison of (a) Mixtral 8x7B and (b) Phi3.5-MoE models under different number of CPU cores (OMP_NUM_THREADS) and cache configurations. The cache legend indicates (# of indexes, # of ways).
  • ...and 1 more figures