Table of Contents
Fetching ...

ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training

Xueze Kang, Guangyu Xiang, Yuxin Wang, Hao Zhang, Yuchu Fang, Yuhang Zhou, Zhenheng Tang, Youhui Lv, Eliran Maman, Mark Wasserman, Alon Zameret, Zhipeng Bian, Shushu Chen, Zhiyou Yu, Jin Wang, Xiaoyu Wu, Yang Zheng, Chen Tian, Xiaowen Chu

TL;DR

ElasWave tackles the challenge of elastic-native training at hyperscale by introducing multi-dimensional scheduling across dataflow, computation graph, device frequency, and RNG to achieve per-step fault tolerance while maintaining parameter and computation consistency. It couples an in-place Dynamic Communicator with non-blocking layer migration, interleaved ZeRO state movement, and per-step in-memory snapshots to minimize MTTR and preserve convergence under disruptions. Key contributions include a dataflow/graph resharding strategy with a minimax layer-partition DP, DVFS-based bubble removal, RNG resharding for deterministic randomness, and live remapping via a Snapshot Pool, all orchestrated by a four-axis scheduler and Recovery Executor. Empirically, ElasWave demonstrates up to $1.60\times$ throughput gains over TorchFT and $1.35\times$ over ReCycle, sub-second MTTR improvements for communicator recovery, and $78\%$ reduction in convergence deviation, indicating strong practical impact for robust large-model pretraining on fluctuating hyperscale clusters.

Abstract

Large-scale LLM pretraining now runs across $10^5$--$10^6$ accelerators, making failures routine and elasticity mandatory. We posit that an elastic-native training system must jointly deliver (i) parameter consistency, (ii) low mean time to recovery (MTTR), (iii) high post-change throughput, and (iv) computation consistency. No prior system achieves all four simultaneously. To achieve these goals, we present ElasWave, which delivers per-step fault tolerance via multi-dimensional scheduling across graph, dataflow, DVFS, and RNG. ElasWave reshapes and reshards micro-batches while preserving the global batch size and gradient scale. It performs online pipeline resharding with asynchronous parameter migration and interleaves ZeRO partitions, reducing parameter recovery processes to disjoint rank-to-rank transfers. It further leverages DVFS to absorb pipeline bubbles and reshards RNG to keep computation consistency. Together, a dynamic communicator enables in-place communication group edits, while per-step in-memory snapshots support online verification and redistribution. We evaluate ElasWave on 96 NPUs and benchmark it against state-of-the-art baselines: throughput improves by $1.35\times$ over ReCycle and $1.60\times$ over TorchFT; communicator recovery completes within one second (up to $82\times/3.6\times$ faster than full/partial rebuilds); migration MTTR drops by as much as $51\%$; and convergence deviation is reduced by approximately $78\%$.

ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training

TL;DR

ElasWave tackles the challenge of elastic-native training at hyperscale by introducing multi-dimensional scheduling across dataflow, computation graph, device frequency, and RNG to achieve per-step fault tolerance while maintaining parameter and computation consistency. It couples an in-place Dynamic Communicator with non-blocking layer migration, interleaved ZeRO state movement, and per-step in-memory snapshots to minimize MTTR and preserve convergence under disruptions. Key contributions include a dataflow/graph resharding strategy with a minimax layer-partition DP, DVFS-based bubble removal, RNG resharding for deterministic randomness, and live remapping via a Snapshot Pool, all orchestrated by a four-axis scheduler and Recovery Executor. Empirically, ElasWave demonstrates up to throughput gains over TorchFT and over ReCycle, sub-second MTTR improvements for communicator recovery, and reduction in convergence deviation, indicating strong practical impact for robust large-model pretraining on fluctuating hyperscale clusters.

Abstract

Large-scale LLM pretraining now runs across -- accelerators, making failures routine and elasticity mandatory. We posit that an elastic-native training system must jointly deliver (i) parameter consistency, (ii) low mean time to recovery (MTTR), (iii) high post-change throughput, and (iv) computation consistency. No prior system achieves all four simultaneously. To achieve these goals, we present ElasWave, which delivers per-step fault tolerance via multi-dimensional scheduling across graph, dataflow, DVFS, and RNG. ElasWave reshapes and reshards micro-batches while preserving the global batch size and gradient scale. It performs online pipeline resharding with asynchronous parameter migration and interleaves ZeRO partitions, reducing parameter recovery processes to disjoint rank-to-rank transfers. It further leverages DVFS to absorb pipeline bubbles and reshards RNG to keep computation consistency. Together, a dynamic communicator enables in-place communication group edits, while per-step in-memory snapshots support online verification and redistribution. We evaluate ElasWave on 96 NPUs and benchmark it against state-of-the-art baselines: throughput improves by over ReCycle and over TorchFT; communicator recovery completes within one second (up to faster than full/partial rebuilds); migration MTTR drops by as much as ; and convergence deviation is reduced by approximately .

Paper Structure

This paper contains 29 sections, 3 equations, 15 figures, 4 tables, 2 algorithms.

Figures (15)

  • Figure 1: Pipeline schedule for ReCycle after a failure at node (1,2). ReCycle reroutes the failed rank's work to peers in the same stage (e.g., (0,2) and (2,2)), creating a straggler. While its decoupled backward pass creates bubbles to absorb the extra work, the large number of rerouted micro-batches quickly exhausts this bubble budget. The strategy also extends activation lifetimes, which increases memory pressure and risks Out-of-Memory (OOM) errors.
  • Figure 2: System architecture of ElasWave, illustrating the elastic recovery workflow and its triggers (fail-stop, fail-slow, and resource-scheduling signals). (a) When the ElasWave Agent detects a failure, straggler, or scheduling signals, it reports the current cluster state to the ElasWave Core. The Core then generates a multi-dim plan to optimize four key goals. (b) The Engine first pauses the training job via the Recovery Executor. (c) The recovery plan is dispatched to the Recovery Executor. The Executor uses the plan to perform an accelerated live remap, reconfigure links, and set the new dataflow, using state provided by the Parameter Fabric from the in-memory Snapshot Pool. Once the cluster is reconfigured, training resumes.
  • Figure 3: To optimize throughput, ElasWave's multi-dim scheduling combines three strategies. After a failure, it first performs Data Reshard in DP domain (①), then uses Pipeline Reshard in PP domain (②) to balance the workload, and finally eliminates remaining pipeline bubbles by DVFS (③).
  • Figure 4: Comparison of pipeline schedule examples. The steady phase illustrates how multi-dim scheduling in (ii)(iii) avoids the stragglers and OOM issues in data rerouting in (i), achieving efficient and reliable training states step-by-step. The total execution times($T_i$, $T_{ii}$, $T_{iii}$) show a progressive reduction in pipeline completion time.
  • Figure 5: Elastic training without (b) or with (c) RNG Resharding. In (b), $L_1$ in $PP_0$ will be transferred to $PP_1$, the RNG state $R_{0,1}^0$ and $R_{1,1}^0$ in $PP_1$ will be directly applied to $L_1$, which introduces inconsistency. Besides, Data 1 is allocated to $DP_0$, but will be processed by RNG state $R_{0,0}^0$, which is also inconsistent with (a). In (c), RNG state $R_{1,0}^0$ will also be saved in $DP_0$, and will be applied to Data 1. After processing $L_0$, $R_{0,0}^1$ and $R_{0,1}^1$ will be sent to $PP_1$ of $DP_0$ and $DP_1$, respectively, and be used to process the trasferred $L_1$. Therefore, (c) achieves a similar convergent behavior.
  • ...and 10 more figures