Table of Contents
Fetching ...

Communication-Efficient Sparsely-Activated Model Training via Sequence Migration and Token Condensation

Fahao Chen, Peng Li, Zicong Hong, Zhou Su, Song Guo

TL;DR

Luffy, a communication-efficient distributed MoE training system with two new techniques, which migrates sequences among GPUs to hide heavy token pulling paths within GPUs and avoid copying experts over GPUs and proposes token condensation that identifies similar tokens and then eliminates redundant transmissions.

Abstract

Mixture-of-Experts (MoE) is an emerging technique for scaling large models with sparse activation. MoE models are typically trained in a distributed manner with an expert parallelism scheme, where experts in each MoE layer are distributed across multiple GPUs. However, the default expert parallelism suffers from the heavy network burden due to the all-to-all intermediate data exchange among GPUs before and after the expert run. Some existing works have proposed to reduce intermediate data exchanges by transferring experts to reduce the network loads, however, which would decrease parallelism level of expert execution and make computation inefficient. The weaknesses of existing works motivate us to explore whether it is possible to reduce inter-GPU traffic while maintaining a high degree of expert parallelism. This paper gives a positive response by presenting Luffy, a communication-efficient distributed MoE training system with two new techniques. First, Luffy migrates sequences among GPUs to hide heavy token pulling paths within GPUs and avoid copying experts over GPUs. Second, we propose token condensation that identifies similar tokens and then eliminates redundant transmissions. We implement Luffy based on PyTorch and evaluate its performance on a testbed of 16 V100 GPUs. Luffy system can achieve a speedup of up to 2.73x compared to state-of-the-art MoE training systems.

Communication-Efficient Sparsely-Activated Model Training via Sequence Migration and Token Condensation

TL;DR

Luffy, a communication-efficient distributed MoE training system with two new techniques, which migrates sequences among GPUs to hide heavy token pulling paths within GPUs and avoid copying experts over GPUs and proposes token condensation that identifies similar tokens and then eliminates redundant transmissions.

Abstract

Mixture-of-Experts (MoE) is an emerging technique for scaling large models with sparse activation. MoE models are typically trained in a distributed manner with an expert parallelism scheme, where experts in each MoE layer are distributed across multiple GPUs. However, the default expert parallelism suffers from the heavy network burden due to the all-to-all intermediate data exchange among GPUs before and after the expert run. Some existing works have proposed to reduce intermediate data exchanges by transferring experts to reduce the network loads, however, which would decrease parallelism level of expert execution and make computation inefficient. The weaknesses of existing works motivate us to explore whether it is possible to reduce inter-GPU traffic while maintaining a high degree of expert parallelism. This paper gives a positive response by presenting Luffy, a communication-efficient distributed MoE training system with two new techniques. First, Luffy migrates sequences among GPUs to hide heavy token pulling paths within GPUs and avoid copying experts over GPUs. Second, we propose token condensation that identifies similar tokens and then eliminates redundant transmissions. We implement Luffy based on PyTorch and evaluate its performance on a testbed of 16 V100 GPUs. Luffy system can achieve a speedup of up to 2.73x compared to state-of-the-art MoE training systems.

Paper Structure

This paper contains 21 sections, 2 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison between existing works and our ideas. We assume that each GPU holds one expert (e.g., GPU $k$ holds Expert $k$). A: attention computation; E: expert computation; D: token dispatch; C: token combine; T: expert transfer. Each rectangle represents a part of a sequence, and the width of a rectangle indicates its length. Arrows with solid and dashed lines represent intra-GPU and inter-GPU traffic, respectively, and the arrow width indicates the amount of traffic. The red hatched rectangles in (d) represent the tokens eliminated by token condensation. The block index is denoted by $b$.
  • Figure 2: An illustration of MoE.
  • Figure 3: Biased expert activation for sequences under different models after 30 training iterations, where a training iteration indicates the training on a batch of data. Different colors represent hotness values, which indicate the portions of tokens routed to different experts.
  • Figure 4: Batch time on one GPU with different number of experts. The batch size is set as 1.
  • Figure 5: Token similarity and the change after the expert execution after 30 training iterations. All results are shown over block 1 (left), block 3 (middle), and block 6 (right).
  • ...and 5 more figures