Table of Contents
Fetching ...

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, Xin Liu

TL;DR

MegaScale-Infer tackles the inefficiency of serving large MoE-based LLMs by disaggregating attention and FFN modules onto separate GPUs, enabling independent scaling and heterogeneous deployment. It introduces ping-pong pipeline parallelism to overlap computation and communication and a specialized M2N communication library to dramatically reduce overhead compared with NCCL. The approach, validated on multi-node clusters with models from ~132B to ~317B parameters, achieves up to 1.90x per-GPU throughput and up to 1.66x end-to-end throughput per cost in heterogeneous deployments, while reducing serving costs in production. These results demonstrate a practical path to cost-effective, scalable MoE serving at scale.

Abstract

Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE's sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.90x higher per-GPU throughput than state-of-the-art solutions.

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

TL;DR

MegaScale-Infer tackles the inefficiency of serving large MoE-based LLMs by disaggregating attention and FFN modules onto separate GPUs, enabling independent scaling and heterogeneous deployment. It introduces ping-pong pipeline parallelism to overlap computation and communication and a specialized M2N communication library to dramatically reduce overhead compared with NCCL. The approach, validated on multi-node clusters with models from ~132B to ~317B parameters, achieves up to 1.90x per-GPU throughput and up to 1.66x end-to-end throughput per cost in heterogeneous deployments, while reducing serving costs in production. These results demonstrate a practical path to cost-effective, scalable MoE serving at scale.

Abstract

Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE's sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.90x higher per-GPU throughput than state-of-the-art solutions.

Paper Structure

This paper contains 21 sections, 5 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: GPU utilization of Attention and FFN vs. batch size in dense model, MoE, and MegaScale-Infer during decoding.
  • Figure 2: MoE and expert parallelism.
  • Figure 3: MegaScale-Infer runtime instance architecture.
  • Figure 4: Illustration of ping-pong pipeline parallelism.
  • Figure 5: One-to-N latency: a single sender sends 128K bytes to each receiver in N, where |N| = {8, 16, 32}.
  • ...and 11 more figures