Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping

Chenyu Jiang; Ye Tian; Zhen Jia; Shuai Zheng; Chuan Wu; Yida Wang

Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping

Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang

TL;DR

Lancet targets the MoE training bottleneck caused by long all-to-all communication by expanding the overlap to the entire training graph. It introduces a compiler-based approach with two optimization passes: a Weight Gradient Computation Scheduling pass to overlap backward-weight gradients with all-to-all, and an Operator Partition Pass to partition and pipeline forward non-MoE and MoE computations using dynamic programming and a constraint-satisfaction-based axis inference. Key contributions include extending the focus region, a greedy weight-gradient scheduling algorithm, an irregular all-to-all–aware partitioning scheme, and a cost-model–driven optimization pipeline; evaluations show up to 77% reduction in non-overlapped communication and up to 1.3x end-to-end speedups over baselines like DeepSpeed and Tutel. This work improves MoE training throughput on large-scale hardware and is designed to be compatible with other MoE optimization techniques, offering a practical route to scale MoE models further.

Abstract

The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters. However, it faces the challenge of extended all-to-all communication latency during the training process. Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation. Yet, these methods frequently fall short of achieving sufficient overlap, consequently restricting the potential for performance enhancements. In our study, we extend the scope of this challenge by considering overlap at the broader training graph level. During the forward pass, we enable non-MoE computations to overlap with all-to-all through careful partitioning and pipelining. In the backward pass, we achieve overlap with all-to-all by scheduling gradient weight computations. We implement these techniques in Lancet, a system using compiler-based optimization to automatically enhance MoE model training. Our extensive evaluation reveals that Lancet significantly reduces the time devoted to non-overlapping communication, by as much as 77%. Moreover, it achieves a notable end-to-end speedup of up to 1.3 times when compared to the state-of-the-art solutions.

Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping

TL;DR

Abstract

Paper Structure (36 sections, 4 equations, 16 figures, 1 algorithm)

This paper contains 36 sections, 4 equations, 16 figures, 1 algorithm.

Introduction
Background and Motivation
Mixture of Experts (MoE)
Expert-parallelism
Routing algorithms
Overlapping all-to-all and experts
Opportunities and Challenges
Opportunity 1: Weight gradient computation.
Opportunity 2: Non-MoE computation.
Challenge 1: How to perform mathematically equivalent partition.
Challenge 2: How to determine the optimal partition range for non-MoE operators.
Lancet Overview
Weight Gradient Computation Schedule Pass
Weight Gradient Computation Labelling
Weight Gradient Computation Scheduling
...and 21 more sections

Figures (16)

Figure 1: An example MoE layer with 4 experts scattered on 2 devices. Assume top-1 gating is used. Blue (green) boxes represent computation (communication) operators. Data dependency between operators are highlighted by red arrows. The Gate assigns each input token to an expert. All-to-alls fetch expert input/output from other devices. Gather restores the received tokens back to their original order, matching the input sequence.
Figure 2: Breakdown of execution time when running a GPT-2 model with MoE layers using Tutel and DeepSpeed on Amazon EC2 p3dn instances. Orig.: unoptimized execution time. Curr.: performance upper-bound when optimized using current overlapping methods (expert computation completely hidden by all-to-all). Opt.: ideal execution time (all-to-all fully overlapped by computation).
Figure 3: Scheduling weight gradient computation to overlap with all-to-all.
Figure 4: Performance gain of different overlapping types.
Figure 5: Operator partitioning scheme in an MoE layer. Number in each token shows their assigned expert. Tokens of the same color belong to the same sequence.
...and 11 more figures

Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping

TL;DR

Abstract

Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping

Authors

TL;DR

Abstract

Table of Contents

Figures (16)