Table of Contents
Fetching ...

Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism

Yuhao Qing, Guichao Zhu, Fanxin Li, Lintian Lei, Zekai Sun, Xiuxian Guan, Shixiong Zhao, Xusheng Chen, Dong Huang, Sen Wang, Heming Cui

TL;DR

The paper addresses the bottlenecks of training expansive MoE-based PTMs, where imbalanced expert loads cause straggler effects under expert parallelism. It introduces Fully Sharded Sparse Data Parallelism (FSSDP), which fully shards MoE parameters and optimizers and uses SparseAllGather and SparseReduceScatter to materialize dynamic expert placements each iteration, removing explicit rearrangement from the critical path. Building on FSSDP, the Hecate system implements heterogeneous sharding, sparse materialization, and topology-aware token dispatching to achieve high throughput with low memory overhead. Empirical results across multiple models and clusters show up to 3.54x speedups over state-of-the-art baselines, with notable improvements in memory efficiency and robustness across configurations, demonstrating practical scalability for MoE training at scale.

Abstract

Mixture-of-Experts (MoE) has emerged as a promising sparse paradigm for scaling up pre-trained models (PTMs) with remarkable cost-effectiveness. However, the dynamic nature of MoE leads to rapid fluctuations and imbalances in expert loads during training, resulting in significant straggler effects that hinder training performance when using expert parallelism (EP). Existing MoE training systems attempt to mitigate these effects through expert rearrangement strategies, but they face challenges in terms of memory efficiency and timeliness of rearrangement. This paper proposes Fully Sharded Sparse Data Parallelism (FSSDP), an innovative approach that tackles the parallelization of MoE layers and potential straggler effects caused by imbalanced expert loads from a new perspective. FSSDP fully shards the parameters and optimizer states of MoE layers across devices and sparsely materializes MoE parameters from scratch in each iteration with two sparse collectives SparseAllGather and SparseReduceScatter. We build Hecate, a high-performance MoE training system that incorporates FSSDP to fully unlock its potential. Hecate introduces heterogeneous sharding, sparse materialization, and re-materialization techniques to construct flexible and efficient expert placements with low memory and communication overhead. Our evaluation reveals that Hecate achieves up to 3.54x speedup compared over state-of-the-art MoE training systems and consistently demonstrates improvements across model architectures and hardware environments.

Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism

TL;DR

The paper addresses the bottlenecks of training expansive MoE-based PTMs, where imbalanced expert loads cause straggler effects under expert parallelism. It introduces Fully Sharded Sparse Data Parallelism (FSSDP), which fully shards MoE parameters and optimizers and uses SparseAllGather and SparseReduceScatter to materialize dynamic expert placements each iteration, removing explicit rearrangement from the critical path. Building on FSSDP, the Hecate system implements heterogeneous sharding, sparse materialization, and topology-aware token dispatching to achieve high throughput with low memory overhead. Empirical results across multiple models and clusters show up to 3.54x speedups over state-of-the-art baselines, with notable improvements in memory efficiency and robustness across configurations, demonstrating practical scalability for MoE training at scale.

Abstract

Mixture-of-Experts (MoE) has emerged as a promising sparse paradigm for scaling up pre-trained models (PTMs) with remarkable cost-effectiveness. However, the dynamic nature of MoE leads to rapid fluctuations and imbalances in expert loads during training, resulting in significant straggler effects that hinder training performance when using expert parallelism (EP). Existing MoE training systems attempt to mitigate these effects through expert rearrangement strategies, but they face challenges in terms of memory efficiency and timeliness of rearrangement. This paper proposes Fully Sharded Sparse Data Parallelism (FSSDP), an innovative approach that tackles the parallelization of MoE layers and potential straggler effects caused by imbalanced expert loads from a new perspective. FSSDP fully shards the parameters and optimizer states of MoE layers across devices and sparsely materializes MoE parameters from scratch in each iteration with two sparse collectives SparseAllGather and SparseReduceScatter. We build Hecate, a high-performance MoE training system that incorporates FSSDP to fully unlock its potential. Hecate introduces heterogeneous sharding, sparse materialization, and re-materialization techniques to construct flexible and efficient expert placements with low memory and communication overhead. Our evaluation reveals that Hecate achieves up to 3.54x speedup compared over state-of-the-art MoE training systems and consistently demonstrates improvements across model architectures and hardware environments.

Paper Structure

This paper contains 21 sections, 4 equations, 15 figures, 1 table, 2 algorithms.

Figures (15)

  • Figure 1: MoE parallel training strategies for a single Transformer-MoE layer. The ellipses between All-to-All represent other layers in the model. (d) Rearrangement adjusts expert placements to mitigate straggler effects of EP. The red bars under experts are work loads of devices. (b) Rearrangement systems have lower MoE computation and All-to-All latency than EP, but introduces rearrangement overhead in the performance critical path. (e) FSSDP achieves the same balanced placement as rearrangement per iteration using two sparse collectives, while avoiding rearrangement overheads between iterations from (c). The dashed-line SparseAllGather box re-materializes parameters of the following backward computation.
  • Figure 2: Illustrations of an MoE layer with a top-$2$ gate.
  • Figure 3: Expert load distribution during training, with colors indicating token proportions per expert.
  • Figure 4: FSDP vs. FSDDP (on a single MoE layer)
  • Figure 5: Workflow of FSSDP at MoE layer $l$ in an iteration. $E^{l}_{i}$ represents expert $i$ of MoE layer $l$ in the PTM. The sharding phase partitions the MoE layer's parameters and optimizer states into MoE shards placed across devices. The materialization phase handles the sparse data parallelism with two novel collectives, SparseAllGather and SparseReduceScatter.
  • ...and 10 more figures