Table of Contents
Fetching ...

LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training

Xinyi Liu, Yujie Wang, Fangcheng Fu, Xuefeng Xiao, Huixia Li, Jiashi Li, Bin Cui

TL;DR

This work tackles the persistent load imbalance in MoE training caused by dynamic routing. It introduces Fully Sharded Expert Parallelism (FSEP), which fully shards expert parameters across devices and restores complete parameters on demand, enabling flexible in-training re-layout with minimal overhead. Complementing FSEP, a load-balancing planner jointly optimizes per-iteration expert layout and token routing to minimize communication and computation time, while hiding re-layout costs through careful overlap with computation. Empirical results on multi-GPU clusters show up to $1.69\times$ speedups over state-of-the-art systems, demonstrating significant gains in throughput and tail-latency reduction for large-scale MoE training. The approach is modular and compatible with existing systems, offering a scalable path toward efficient trillion-parameter MoE models.

Abstract

Expert parallelism is vital for effectively training Mixture-of-Experts (MoE) models, enabling different devices to host distinct experts, with each device processing different input data. However, during expert parallel training, dynamic routing results in significant load imbalance among experts: a handful of overloaded experts hinder overall iteration, emerging as a training bottleneck. In this paper, we introduce LAER-MoE, an efficient MoE training framework. The core of LAER-MoE is a novel parallel paradigm, Fully Sharded Expert Parallel (FSEP), which fully partitions each expert parameter by the number of devices and restores partial experts at expert granularity through All-to-All communication during training. This allows for flexible re-layout of expert parameters during training to enhance load balancing. In particular, we perform fine-grained scheduling of communication operations to minimize communication overhead. Additionally, we develop a load balancing planner to formulate re-layout strategies of experts and routing schemes for tokens during training. We perform experiments on an A100 cluster, and the results indicate that our system achieves up to 1.69x acceleration compared to the current state-of-the-art training systems. Source code available at https://github.com/PKU-DAIR/Hetu-Galvatron/tree/laer-moe.

LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training

TL;DR

This work tackles the persistent load imbalance in MoE training caused by dynamic routing. It introduces Fully Sharded Expert Parallelism (FSEP), which fully shards expert parameters across devices and restores complete parameters on demand, enabling flexible in-training re-layout with minimal overhead. Complementing FSEP, a load-balancing planner jointly optimizes per-iteration expert layout and token routing to minimize communication and computation time, while hiding re-layout costs through careful overlap with computation. Empirical results on multi-GPU clusters show up to speedups over state-of-the-art systems, demonstrating significant gains in throughput and tail-latency reduction for large-scale MoE training. The approach is modular and compatible with existing systems, offering a scalable path toward efficient trillion-parameter MoE models.

Abstract

Expert parallelism is vital for effectively training Mixture-of-Experts (MoE) models, enabling different devices to host distinct experts, with each device processing different input data. However, during expert parallel training, dynamic routing results in significant load imbalance among experts: a handful of overloaded experts hinder overall iteration, emerging as a training bottleneck. In this paper, we introduce LAER-MoE, an efficient MoE training framework. The core of LAER-MoE is a novel parallel paradigm, Fully Sharded Expert Parallel (FSEP), which fully partitions each expert parameter by the number of devices and restores partial experts at expert granularity through All-to-All communication during training. This allows for flexible re-layout of expert parameters during training to enhance load balancing. In particular, we perform fine-grained scheduling of communication operations to minimize communication overhead. Additionally, we develop a load balancing planner to formulate re-layout strategies of experts and routing schemes for tokens during training. We perform experiments on an A100 cluster, and the results indicate that our system achieves up to 1.69x acceleration compared to the current state-of-the-art training systems. Source code available at https://github.com/PKU-DAIR/Hetu-Galvatron/tree/laer-moe.
Paper Structure (36 sections, 2 equations, 12 figures, 4 tables, 4 algorithms)

This paper contains 36 sections, 2 equations, 12 figures, 4 tables, 4 algorithms.

Figures (12)

  • Figure 1: (a) Token distribution during training Mixtral 8x7B, showing significant imbalance. (b) Time breakdown, where "default" denotes the profiling result without auxiliary loss, and "balanced" denotes the result when enforcing fully balanced routing.
  • Figure 2: Loss curve with different auxiliary loss weights.
  • Figure 3: The overview of LAER-MoE.
  • Figure 4: Illustration of FSEP, where $N=4$, $E=4$, $C=2$.
  • Figure 5: Communication optimization in FSEP, where A represents Attention layer, M represents MoE layer, S represents Stream, blue blocks on S1 represent forward and backward computation (F and B), yellow blocks on S3 represent All-to-All communication of token dispatcher (A2A), and red blocks on S2 and S4 represent prefetching communication (P) and gradient synchronization (Sy).
  • ...and 7 more figures