Table of Contents
Fetching ...

Pro-Prophet: A Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models

Wei Wang, Zhiquan Lai, Shengwei Li, Weijie Liu, Keshi Ge, Ao Shen, Huayou Su, Dongsheng Li

TL;DR

A systematic load-balancing method, Pro-Prophet, which consists of a planner and a scheduler for efficient parallel training of large-scale MoE models and achieves up to 2.66x speedup compared to Deepspeed-MoE and FasterMoE and achieves a load-balancing enhancement of up to 11.01x when compared to FasterMoE.

Abstract

The size of deep learning models has been increasing to enhance model quality. The linear increase in training computation budget with model size means that training an extremely large-scale model is exceedingly time-consuming. Recently, the Mixture of Expert (MoE) has drawn significant attention as it can scale models to extra-large sizes with a stable computation budget. However, inefficient distributed training of large-scale MoE models hinders their broader application. Specifically, a considerable dynamic load imbalance occurs among devices during training, significantly reducing throughput. Several load-balancing works have been proposed to address the challenge. System-level solutions draw more attention for their hardware affinity and non-disruption of model convergence compared to algorithm-level ones. However, they are troubled by high communication costs and poor communication-computation overlapping. To address these challenges, we propose a systematic load-balancing method, Pro-Prophet, which consists of a planner and a scheduler for efficient parallel training of large-scale MoE models. To adapt to the dynamic load imbalance, we profile training statistics and use them to design Pro-Prophet. For lower communication volume, Pro-Prophet planner determines a series of lightweight load-balancing strategies and efficiently searches for a communication-efficient one for training based on the statistics. For sufficient overlapping of communication and computation, Pro-Prophet scheduler schedules the data-dependent operations based on the statistics and operation features, further improving the training throughput. Experimental results indicate that Pro-Prophet achieves up to 2.66x speedup compared to Deepspeed-MoE and FasterMoE. Additionally, Pro-Prophet achieves a load-balancing enhancement of up to 11.01x when compared to FasterMoE.

Pro-Prophet: A Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models

TL;DR

A systematic load-balancing method, Pro-Prophet, which consists of a planner and a scheduler for efficient parallel training of large-scale MoE models and achieves up to 2.66x speedup compared to Deepspeed-MoE and FasterMoE and achieves a load-balancing enhancement of up to 11.01x when compared to FasterMoE.

Abstract

The size of deep learning models has been increasing to enhance model quality. The linear increase in training computation budget with model size means that training an extremely large-scale model is exceedingly time-consuming. Recently, the Mixture of Expert (MoE) has drawn significant attention as it can scale models to extra-large sizes with a stable computation budget. However, inefficient distributed training of large-scale MoE models hinders their broader application. Specifically, a considerable dynamic load imbalance occurs among devices during training, significantly reducing throughput. Several load-balancing works have been proposed to address the challenge. System-level solutions draw more attention for their hardware affinity and non-disruption of model convergence compared to algorithm-level ones. However, they are troubled by high communication costs and poor communication-computation overlapping. To address these challenges, we propose a systematic load-balancing method, Pro-Prophet, which consists of a planner and a scheduler for efficient parallel training of large-scale MoE models. To adapt to the dynamic load imbalance, we profile training statistics and use them to design Pro-Prophet. For lower communication volume, Pro-Prophet planner determines a series of lightweight load-balancing strategies and efficiently searches for a communication-efficient one for training based on the statistics. For sufficient overlapping of communication and computation, Pro-Prophet scheduler schedules the data-dependent operations based on the statistics and operation features, further improving the training throughput. Experimental results indicate that Pro-Prophet achieves up to 2.66x speedup compared to Deepspeed-MoE and FasterMoE. Additionally, Pro-Prophet achieves a load-balancing enhancement of up to 11.01x when compared to FasterMoE.

Paper Structure

This paper contains 19 sections, 8 equations, 16 figures, 5 tables, 2 algorithms.

Figures (16)

  • Figure 1: The structure of a MoE model and MoE layer. The MoE model consists of both MoE and non-MoE layers stacked on top of each other. The MoE layer consists of a series of experts and a gate network for routing input to experts. For each input, the gate network computes the relationship between the input with three experts and allocates it to the top-$1$ expert for computation.
  • Figure 2: A workflow of Expert Parallelism (EP) in a MoE layer. Following the gate network, batched inputs are first exchanged via an All-to-All (A2A) operation. After the expert computation on all devices, a second A2A operation is used to pass the expert's outputs back to the device where corresponding inputs were originally located.
  • Figure 3: The imbalanced load of experts in an iteration. The model contains 12 MoE layers and each MoE layer contains 16 experts. The vertical axis indicates layer indexes, and the horizontal axis denotes the index of experts. The depth of color represents the proportion of total inputs that an expert handles. Three of the heaviest experts are responsible for over 50% inputs while the three least experts only compute less than 5%.
  • Figure 4: The locality of input distributions. The discrepancies between the different colored curves represent the number of inputs received by each of the different experts. It shows that distributions of adjacent iterations remain relatively constant.
  • Figure 5: The overview of Pro-Prophet. Pro-Prophet is composed of Pro-Prophet planner and Pro-Prophet scheduler. MoE model, locality, and device pool are three inputs of it. Firstly, Pro-Prophet planner searches for a communication-efficient expert placement using its locality-based greedy algorithm. The algorithm iteratively generates and evaluates a lightweight expert placement utilizing its performance model until the load is balanced. Then the execution engine produces a load-balancing workflow based on the planner. Finally, Pro-Prophet scheduler schedules three data-dependent operations to parallel operations for communication and computation overlapping, further improving the training throughput.
  • ...and 11 more figures