EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai
TL;DR
This work tackles throughput bottlenecks in large MoE inference by introducing EPS-MoE, a three-part framework that optimizes kernel choice (GroupGemm vs DenseGemm), partitions work with an expert-aware pipeline scheduler, and overlaps computation with communication. By applying a principled Parallel Strategy (TP/DP for Attention, EP for MoE) and a load-aware Expert Pipeline Scheduler, the approach achieves substantial prefill throughput gains (up to 52.4% overall, with real-world cases reaching 120K tokens/s on DeepSeekV2). The paper provides detailed hardware/algorithmic analysis, kernel-switching criteria, and extensive ablations across multiple models (DeepSeekV2, Mixtral8x7B, DBRX, Snowflake Arctic), demonstrating the practicality and scalability of the solution. The results suggest that dynamic, load-aware GEMM selection and kernel-overlap-driven scheduling can meaningfully improve MoE inference in multi-GPU deployments, enabling cheaper and faster LLM inference at scale.
Abstract
The Mixture-of-Experts (MoE) model has emerged as a prominent architecture in the field of Large Language Models (LLMs), providing a better balance between model performance and computational efficiency. However the General Matrix Multiply (GEMM) operations and large parameters introduce challenges related to computational efficiency and communication overhead, which become throughput bottlenecks during inference. Applying a single parallelism strategy like EP, DP, TP or a straightforward combination of them to MoE usually achieves sub-optimal inference throughput. This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes. Our approach optimizes the computation of MoE FeedForward Network (FFN) modules by dynamically selecting the best kernel implementation of GroupGemm and DenseGemm for different loads and adaptively overlapping these computations with communication, leading to a substantial increase in throughput. Our experimental results demonstrate at most 52.4\% improvement in prefill throughput compared to existing parallel inference methods. Specifically, our method accelerated the highly optimized DeepSeekV2 model from a claimed 100K tokens per second to at least 120K tokens per second.
