Table of Contents
Fetching ...

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai

TL;DR

This work tackles throughput bottlenecks in large MoE inference by introducing EPS-MoE, a three-part framework that optimizes kernel choice (GroupGemm vs DenseGemm), partitions work with an expert-aware pipeline scheduler, and overlaps computation with communication. By applying a principled Parallel Strategy (TP/DP for Attention, EP for MoE) and a load-aware Expert Pipeline Scheduler, the approach achieves substantial prefill throughput gains (up to 52.4% overall, with real-world cases reaching 120K tokens/s on DeepSeekV2). The paper provides detailed hardware/algorithmic analysis, kernel-switching criteria, and extensive ablations across multiple models (DeepSeekV2, Mixtral8x7B, DBRX, Snowflake Arctic), demonstrating the practicality and scalability of the solution. The results suggest that dynamic, load-aware GEMM selection and kernel-overlap-driven scheduling can meaningfully improve MoE inference in multi-GPU deployments, enabling cheaper and faster LLM inference at scale.

Abstract

The Mixture-of-Experts (MoE) model has emerged as a prominent architecture in the field of Large Language Models (LLMs), providing a better balance between model performance and computational efficiency. However the General Matrix Multiply (GEMM) operations and large parameters introduce challenges related to computational efficiency and communication overhead, which become throughput bottlenecks during inference. Applying a single parallelism strategy like EP, DP, TP or a straightforward combination of them to MoE usually achieves sub-optimal inference throughput. This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes. Our approach optimizes the computation of MoE FeedForward Network (FFN) modules by dynamically selecting the best kernel implementation of GroupGemm and DenseGemm for different loads and adaptively overlapping these computations with communication, leading to a substantial increase in throughput. Our experimental results demonstrate at most 52.4\% improvement in prefill throughput compared to existing parallel inference methods. Specifically, our method accelerated the highly optimized DeepSeekV2 model from a claimed 100K tokens per second to at least 120K tokens per second.

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

TL;DR

This work tackles throughput bottlenecks in large MoE inference by introducing EPS-MoE, a three-part framework that optimizes kernel choice (GroupGemm vs DenseGemm), partitions work with an expert-aware pipeline scheduler, and overlaps computation with communication. By applying a principled Parallel Strategy (TP/DP for Attention, EP for MoE) and a load-aware Expert Pipeline Scheduler, the approach achieves substantial prefill throughput gains (up to 52.4% overall, with real-world cases reaching 120K tokens/s on DeepSeekV2). The paper provides detailed hardware/algorithmic analysis, kernel-switching criteria, and extensive ablations across multiple models (DeepSeekV2, Mixtral8x7B, DBRX, Snowflake Arctic), demonstrating the practicality and scalability of the solution. The results suggest that dynamic, load-aware GEMM selection and kernel-overlap-driven scheduling can meaningfully improve MoE inference in multi-GPU deployments, enabling cheaper and faster LLM inference at scale.

Abstract

The Mixture-of-Experts (MoE) model has emerged as a prominent architecture in the field of Large Language Models (LLMs), providing a better balance between model performance and computational efficiency. However the General Matrix Multiply (GEMM) operations and large parameters introduce challenges related to computational efficiency and communication overhead, which become throughput bottlenecks during inference. Applying a single parallelism strategy like EP, DP, TP or a straightforward combination of them to MoE usually achieves sub-optimal inference throughput. This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes. Our approach optimizes the computation of MoE FeedForward Network (FFN) modules by dynamically selecting the best kernel implementation of GroupGemm and DenseGemm for different loads and adaptively overlapping these computations with communication, leading to a substantial increase in throughput. Our experimental results demonstrate at most 52.4\% improvement in prefill throughput compared to existing parallel inference methods. Specifically, our method accelerated the highly optimized DeepSeekV2 model from a claimed 100K tokens per second to at least 120K tokens per second.

Paper Structure

This paper contains 26 sections, 25 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Weight partition of DP, TP and EP for two devices and two experts.
  • Figure 2: MoE architecture. Adpoted fromnanoflow. The operations in the yellow boxes are compute-bound, mostly GEMMs. The light blue box operations are memory-bound. The operations in the green boxes are communication operations.
  • Figure 3: Resources view of Nvidia GPU Architecture
  • Figure 4: GroupGemm demonstration. Adpoted from who_says_elephants. All matrix multiplication operations are performed through a single kernel launch.
  • Figure 5: GEMM profiling data. (a) Throughput of different GEMMs was tested with varying input sizes, focusing on Gate and Up matrices from MoE blocks. For GroupGemm, 16 Experts were used with matrix dimensions $[1536, 5120]$, ensuring equal total computation with DenseGemm. (b) GroupGemm [Pingpong] was tested for relative throughput across different groups and problem sizes, comparing current group throughput to the maximum. (c) GroupGemm [Pingpong] was tested with input size $m = 6144$, evaluating throughput across different SM counts and groups.
  • ...and 6 more figures