Table of Contents
Fetching ...

MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems

Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Dayou Du, Tairan Xu, Kai Zou, Edoardo Ponti, Luo Mai

TL;DR

MoE-CAP tackles the challenge of benchmarking sparse Mixture-of-Experts systems by formalizing a CAP (Cost–Accuracy–Performance) framework that accounts for heterogeneous hardware and sparse activation patterns. It introduces a complete cost model, a CAP radar visualization, and sparsity-aware metrics—S-MBU and S-MFU—to accurately quantify resource usage and guide deployment decisions. The approach demonstrates that existing benchmarks overestimate resource costs due to ignoring routing and activation sparsity, and provides an automated, cross-framework evaluation pipeline supported by real-world datasets. Overall, MoE-CAP enables principled, hardware-aware comparisons of MoE designs and promotes co-design of models and systems for practical, cost-effective deployment.

Abstract

The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)-to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios.

MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems

TL;DR

MoE-CAP tackles the challenge of benchmarking sparse Mixture-of-Experts systems by formalizing a CAP (Cost–Accuracy–Performance) framework that accounts for heterogeneous hardware and sparse activation patterns. It introduces a complete cost model, a CAP radar visualization, and sparsity-aware metrics—S-MBU and S-MFU—to accurately quantify resource usage and guide deployment decisions. The approach demonstrates that existing benchmarks overestimate resource costs due to ignoring routing and activation sparsity, and provides an automated, cross-framework evaluation pipeline supported by real-world datasets. Overall, MoE-CAP enables principled, hardware-aware comparisons of MoE designs and promotes co-design of models and systems for practical, cost-effective deployment.

Abstract

The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)-to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios.

Paper Structure

This paper contains 28 sections, 8 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Overview of MoE-CAP. Left: We identify trade-offs between hardware Cost, model Accuracy, and application Performance. Right: MoE-CAP introduces new sparsity-aware metrics and CAP radar diagrams to accurately and comprehensively evaluate MoE systems, helping users choose both the right MoE system and suitable hardware.
  • Figure 2: MoE memory access and performance metrics under three routing scenarios. $S$ denotes the size of a single expert. Existing MBU/MFU definitions overestimate costs by ignoring routing and expert selection. We show the extent of this overestimation relative to actual values.
  • Figure 3: CAP Radar diagrams comparing representative PA, PC, and CA systems. Left: Trade-offs among SGLang (PA), K-Transformers (PC), and MoE-Infinity (CA) on Qwen3-30B-A3B. Right: Trade-offs comparing with offloading (MoE-Infinity) and quantization (SGLang-FP8/AWQ).
  • Figure 4: Benchmarking MoE deployment using sparsity-aware performance metrics. Horizontal lines show the minimum bandwidth required for MoE models to meet a decoding latency target, under two scenarios: full activation (large batch size) and minimal activation (batch size = 1). Blue dots represent each device’s peak bandwidth and TDP; orange dots indicate reduced bandwidth when DRAM offloading is needed. Devices above the lines satisfy the latency requirement. Systems are grouped by deployment class: edge (e.g., robotics, autonomous driving), low-power devices, workstations, and data centers.
  • Figure 5: Illustration of how model sparsity varies with batch size, along with the corresponding deployment scenarios on DeepSeek-V2-Lite, Qwen1.5-MoE and DeepSeek-R1
  • ...and 4 more figures