Table of Contents
Fetching ...

MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints

Yichao Yuan, Lin Ma, Nishil Talati

TL;DR

MoE-Lens tackles the memory bottleneck in resource-constrained MoE LLM serving by introducing a holistic two-stage performance model (Stage 1: theoretical upper bound including CPU memory capacity via PME; Stage 2: realistic scheduling with finite batch and paged KV cache) and an architecture-aware system that aims to reach hardware limits. The system features a resource-aware scheduler, a versatile pipeline (VSLPipe) for co-processing prefill and decode, a Contiguous Data Mover for IO, and a hand-optimized CPU decode attention kernel. Empirically, MoE-Lens delivers an average of 4.6x throughput improvement over state-of-the-art MoE-Lightning (up to 25.5x) with predicted throughput accuracy around 94%, validating the value of holistic modeling and hardware-aware design for high-throughput MoE inference. The work demonstrates practical impact for deploying large MoE LLMs in CPU–GPU hybrids, enabling scalable, batch-oriented inference under tight memory constraints.

Abstract

Mixture of Experts (MoE) LLMs, characterized by their sparse activation patterns, offer a promising approach to scaling language models while avoiding proportionally increasing the inference cost. However, their large parameter sizes present deployment challenges in resource-constrained environments with limited GPU memory capacity, as GPU memory is often insufficient to accommodate the full set of model weights. Consequently, typical deployments rely on CPU-GPU hybrid execution: the GPU handles compute-intensive GEMM operations, while the CPU processes the relatively lightweight attention mechanism. This setup introduces a key challenge: how to effectively optimize resource utilization across CPU and GPU? Prior work has designed system optimizations based on performance models with limited scope. Specifically, such models do not capture the complex interactions between hardware properties and system execution mechanisms. Therefore, previous approaches neither identify nor achieve the hardware limit. This paper presents MoE-Lens, a high-throughput MoE LLM inference system designed through holistic performance modeling for resource-constrained environments. Our performance model thoroughly analyzes various fundamental system components, including CPU memory capacity, GPU compute power, and workload characteristics, to understand the theoretical performance upper bound of MoE inference. Furthermore, it captures the system execution mechanisms to identify the key hardware bottlenecks and accurately predict the achievable throughput. Informed by our performance model, MoE-Lens introduces an inference system approaching hardware limits. Evaluated on diverse MoE models and datasets, MoE-Lens outperforms the state-of-the-art solution by 4.6x on average (up to 25.5x), with our theoretical model predicting performance with an average 94% accuracy.

MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints

TL;DR

MoE-Lens tackles the memory bottleneck in resource-constrained MoE LLM serving by introducing a holistic two-stage performance model (Stage 1: theoretical upper bound including CPU memory capacity via PME; Stage 2: realistic scheduling with finite batch and paged KV cache) and an architecture-aware system that aims to reach hardware limits. The system features a resource-aware scheduler, a versatile pipeline (VSLPipe) for co-processing prefill and decode, a Contiguous Data Mover for IO, and a hand-optimized CPU decode attention kernel. Empirically, MoE-Lens delivers an average of 4.6x throughput improvement over state-of-the-art MoE-Lightning (up to 25.5x) with predicted throughput accuracy around 94%, validating the value of holistic modeling and hardware-aware design for high-throughput MoE inference. The work demonstrates practical impact for deploying large MoE LLMs in CPU–GPU hybrids, enabling scalable, batch-oriented inference under tight memory constraints.

Abstract

Mixture of Experts (MoE) LLMs, characterized by their sparse activation patterns, offer a promising approach to scaling language models while avoiding proportionally increasing the inference cost. However, their large parameter sizes present deployment challenges in resource-constrained environments with limited GPU memory capacity, as GPU memory is often insufficient to accommodate the full set of model weights. Consequently, typical deployments rely on CPU-GPU hybrid execution: the GPU handles compute-intensive GEMM operations, while the CPU processes the relatively lightweight attention mechanism. This setup introduces a key challenge: how to effectively optimize resource utilization across CPU and GPU? Prior work has designed system optimizations based on performance models with limited scope. Specifically, such models do not capture the complex interactions between hardware properties and system execution mechanisms. Therefore, previous approaches neither identify nor achieve the hardware limit. This paper presents MoE-Lens, a high-throughput MoE LLM inference system designed through holistic performance modeling for resource-constrained environments. Our performance model thoroughly analyzes various fundamental system components, including CPU memory capacity, GPU compute power, and workload characteristics, to understand the theoretical performance upper bound of MoE inference. Furthermore, it captures the system execution mechanisms to identify the key hardware bottlenecks and accurately predict the achievable throughput. Informed by our performance model, MoE-Lens introduces an inference system approaching hardware limits. Evaluated on diverse MoE models and datasets, MoE-Lens outperforms the state-of-the-art solution by 4.6x on average (up to 25.5x), with our theoretical model predicting performance with an average 94% accuracy.

Paper Structure

This paper contains 25 sections, 14 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Sample of an execution timeline of GPU computation and CPU-GPU IO during the prefill and decode stages of MoE-Lightning.
  • Figure 2: Overview of MoE-Lens that combines a theoretical performance upper bound, resource and workload aware performance model, and an informed system design to reach hardware limits.
  • Figure 3: Visualization of the maximum GPU utilization $\frac{T_{max}}{T_{GPU}}$. (a) Maximum GPU utilization when running Mixtral8x7B on A40 with 100GB KV cache. (b) For the same model and GPU, the maximum GPU utilization when $p = 100$ and $g=128$.
  • Figure 4: Predicted GPU utilization under different request batch sizes, with $p=100$ and $g=128$.
  • Figure 5: System overview of MoE-Lens.
  • ...and 8 more figures