Table of Contents
Fetching ...

Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems

Zehao Fan, Zhenyu Liu, Yunzhen Liu, Yayue Hou, Hadjer Benmeziane, Kaoutar El Maghraoui, Liu Liu

TL;DR

This work tackles memory-bound MoE inference by leveraging a GPU–NDP system with CXL-attached memory. It introduces a context-aware pipeline that (1) uses prefill-stage activation statistics to place hot experts on GPU and cold ones on NDP, (2) applies per-expert mixed-precision quantization on NDP controlled by a prefix-structured budget, and (3) overlaps GPU and NDP execution to minimize data movement. The approach yields large end-to-end speedups (up to ~11× decoding throughput) with negligible accuracy loss (0.13% on average at 3-bit), outperforming state-of-the-art baselines such as MoNDE and GPU-only Hobbit. These results demonstrate a practical, scalable path to running very large MoE models on memory-limited hardware by exploiting context, near-data processing, and selective precision augmentation.

Abstract

Mixture-of-Experts (MoE) models scale large language models through conditional computation, but inference becomes memory-bound once expert weights exceed the capacity of GPU memory. In this case, weights must be offloaded to external memory, and fetching them incurs costly and repeated transfers. We address this by adopting CXL-attached near-data processing (CXL-NDP) as the offloading tier to execute cold experts in place, converting expensive parameter movement into cheaper activation movement. Unlike prior GPU-NDP systems that are largely context-agnostic and reactive, we develop a context-aware MoE system that uses prefill-stage activation statistics to guide decoding-stage expert placement, dynamically pins hot experts in GPU-side HBM, and maps the remainder to CXL-NDP. To meet NDP's limited compute throughput, we introduce context-aware mixed-precision quantization that allocates per-expert bitwidths (1-4 bit) based on prefill stage. The resulting MoE inference system overlaps GPU and NDP execution while minimizing cross-device movement. The evaluation on the GPU-NDP system shows that our approach achieves up to an 8.7-fold decoding throughput improvement over the state-of-the-art method, while incurring only a 0.13% average accuracy drop.

Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems

TL;DR

This work tackles memory-bound MoE inference by leveraging a GPU–NDP system with CXL-attached memory. It introduces a context-aware pipeline that (1) uses prefill-stage activation statistics to place hot experts on GPU and cold ones on NDP, (2) applies per-expert mixed-precision quantization on NDP controlled by a prefix-structured budget, and (3) overlaps GPU and NDP execution to minimize data movement. The approach yields large end-to-end speedups (up to ~11× decoding throughput) with negligible accuracy loss (0.13% on average at 3-bit), outperforming state-of-the-art baselines such as MoNDE and GPU-only Hobbit. These results demonstrate a practical, scalable path to running very large MoE models on memory-limited hardware by exploiting context, near-data processing, and selective precision augmentation.

Abstract

Mixture-of-Experts (MoE) models scale large language models through conditional computation, but inference becomes memory-bound once expert weights exceed the capacity of GPU memory. In this case, weights must be offloaded to external memory, and fetching them incurs costly and repeated transfers. We address this by adopting CXL-attached near-data processing (CXL-NDP) as the offloading tier to execute cold experts in place, converting expensive parameter movement into cheaper activation movement. Unlike prior GPU-NDP systems that are largely context-agnostic and reactive, we develop a context-aware MoE system that uses prefill-stage activation statistics to guide decoding-stage expert placement, dynamically pins hot experts in GPU-side HBM, and maps the remainder to CXL-NDP. To meet NDP's limited compute throughput, we introduce context-aware mixed-precision quantization that allocates per-expert bitwidths (1-4 bit) based on prefill stage. The resulting MoE inference system overlaps GPU and NDP execution while minimizing cross-device movement. The evaluation on the GPU-NDP system shows that our approach achieves up to an 8.7-fold decoding throughput improvement over the state-of-the-art method, while incurring only a 0.13% average accuracy drop.

Paper Structure

This paper contains 17 sections, 7 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: System overview. During MoE inference, prefill-stage expert activation statistics are collected and fed to two modules: the Expert Placement Module, which runs once per sequence to determine an efficient GPU/NDP expert mapping; the Expert Bitwidth Selector, which uses the same statistics to assign per-expert quantization bitwidths on the NDP device, improving system performance while reducing accuracy loss.
  • Figure 2: Activation frequency of all experts in Mixtral-8x7B.
  • Figure 3: Different expert activation patterns for two samples with Mixtral-8$\times$7B on the C4 dataset, indicating the context dependence.
  • Figure 4: Expert activation similarities between prefill and decoding, motivating context-aware design.
  • Figure 5: End-to-end latency comparison across different methods, with NDP-side latency shown separately to highlight the benefits of our method in both reducing NDP computation and minimizing expert migration.
  • ...and 1 more figures