TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading

Yudong Pan; Yintao He; Tianhua Han; Lian Liu; Shixin Zhao; Zhirong Chen; Mengdi Wang; Cangyuan Li; Yinhe Han; Ying Wang

TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading

Yudong Pan, Yintao He, Tianhua Han, Lian Liu, Shixin Zhao, Zhirong Chen, Mengdi Wang, Cangyuan Li, Yinhe Han, Ying Wang

TL;DR

TriMoE is presented, a novel GPU-CPU-NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units.

Abstract

To deploy large Mixture-of-Experts (MoE) models cost-effectively, offloading-based single-GPU heterogeneous inference is crucial. While GPU-CPU architectures that offload cold experts are constrained by host memory bandwidth, emerging GPU-NDP architectures utilize DIMM-NDP to offload non-hot experts. However, non-hot experts are not a homogeneous memory-bound group: a significant subset of warm experts exists is severely penalized by high GPU I/O latency yet can saturate NDP compute throughput, exposing a critical compute gap. We present TriMoE, a novel GPU-CPU-NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units. We further introduce a bottleneck-aware expert scheduling policy and a prediction-driven dynamic relayout/rebalancing scheme. Experiments demonstrate that TriMoE achieves up to 2.83x speedup over state-of-the-art solutions.

TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading

TL;DR

TriMoE is presented, a novel GPU-CPU-NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units.

Abstract

Paper Structure (25 sections, 7 equations, 9 figures, 3 tables)

This paper contains 25 sections, 7 equations, 9 figures, 3 tables.

Introduction
Background
MoE Architecture & Inference
Offloading in MoE Inference
Motivation
The Scheduling Dilemma of Warm Experts
Opportunity: AMX-CPU Bridging the GPU–NDP Compute Gap
TriMoE
TriMoE Architecture
Bottleneck-Aware Greedy Makespan Expert Scheduling
Prediction-Driven Expert Relayout and Rebalancing
Evaluation
Experiment Setup
TriMoE System
Baselines
...and 10 more sections

Figures (9)

Figure 1: Execution timelines of baseline MoE offloading systems and our proposed architecture.
Figure 2: MoE architecture with shared and routed experts.
Figure 3: Expert Activation across Batch Sizes, Models and Datasets. (a) Fine-grained activation for a specific configuration. (b) Summary of activation across all configurations.
Figure 4: Overview of TriMoE. (a) DIMM-NDP architecture. (b) Online expert scheduling & Offline expert relayout-rebalancing.
Figure 5: Compute Characterization. (a) Measured Throughput vs. Token Count. (b) Empirical GPU-CPU-NDP roofline.
...and 4 more figures

TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading

TL;DR

Abstract

TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading

Authors

TL;DR

Abstract

Table of Contents

Figures (9)