Table of Contents
Fetching ...

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu

TL;DR

COMET addresses the substantial inter-device communication bottleneck in Mixture-of-Experts by enabling fine-grained computation-communication overlapping. It introduces shared-tensor based dependency resolving to decompose data and reschedule computations, and an adaptive workload assignment to balance resources inside fused kernels. The approach yields up to $1.96\times$ per-layer speedups and $1.71\times$ end-to-end improvements on representative MoE models, validated on large GPU clusters and deployed in production. These advances offer practical latency reductions and resource savings, with open-source plans to promote further optimization.

Abstract

Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by $1.96\times$ and for end-to-end execution, COMET delivers a $1.71\times$ speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

TL;DR

COMET addresses the substantial inter-device communication bottleneck in Mixture-of-Experts by enabling fine-grained computation-communication overlapping. It introduces shared-tensor based dependency resolving to decompose data and reschedule computations, and an adaptive workload assignment to balance resources inside fused kernels. The approach yields up to per-layer speedups and end-to-end improvements on representative MoE models, validated on large GPU clusters and deployed in production. These advances offer practical latency reductions and resource savings, with open-source plans to promote further optimization.

Abstract

Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by and for end-to-end execution, COMET delivers a speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.

Paper Structure

This paper contains 24 sections, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Analysis of the execution of MoE. (a) Time breakdown of MoE models executed on 8 H800 GPUs using Megatron-LM. (b) An illustration of communication-computation overlap by partitioning an expert computation kernel into two.
  • Figure 2: Example of an MoE layer across two GPUs, with two experts reside on GPU0 and two reside on GPU1. The MoE layer is composed of two feed-forward layers. In this example, for each token in the input buffer, it is dispatched to three experts ($topk=3$) in layer0 and then the results are combined in layer1. The shape of experts is $N\times K$ in layer0 and $K\times N$ in layer1.
  • Figure 3: Design overview of Comet. Comet is composed of a shared tensor-based dependency resolving method and an adaptive workload assignment mechanism.
  • Figure 4: The producer-consumer modeling of layer0 (left) and layer1 (right) of an MoE layer. The global size of the shared tensor is $(M\times topk, N)$ for both layer0 and layer1.
  • Figure 5: Decompose and reschedule the shared tensor in MoE layer0. In this illustration, three experts are located on Rank 0, each requiring both local and remote data for computation.
  • ...and 9 more figures