ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training

Xueze Kang; Guangyu Xiang; Yuxin Wang; Hao Zhang; Yuchu Fang; Yuhang Zhou; Zhenheng Tang; Youhui Lv; Eliran Maman; Mark Wasserman; Alon Zameret; Zhipeng Bian; Shushu Chen; Zhiyou Yu; Jin Wang; Xiaoyu Wu; Yang Zheng; Chen Tian; Xiaowen Chu

ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training

Xueze Kang, Guangyu Xiang, Yuxin Wang, Hao Zhang, Yuchu Fang, Yuhang Zhou, Zhenheng Tang, Youhui Lv, Eliran Maman, Mark Wasserman, Alon Zameret, Zhipeng Bian, Shushu Chen, Zhiyou Yu, Jin Wang, Xiaoyu Wu, Yang Zheng, Chen Tian, Xiaowen Chu

TL;DR

ElasWave tackles the challenge of elastic-native training at hyperscale by introducing multi-dimensional scheduling across dataflow, computation graph, device frequency, and RNG to achieve per-step fault tolerance while maintaining parameter and computation consistency. It couples an in-place Dynamic Communicator with non-blocking layer migration, interleaved ZeRO state movement, and per-step in-memory snapshots to minimize MTTR and preserve convergence under disruptions. Key contributions include a dataflow/graph resharding strategy with a minimax layer-partition DP, DVFS-based bubble removal, RNG resharding for deterministic randomness, and live remapping via a Snapshot Pool, all orchestrated by a four-axis scheduler and Recovery Executor. Empirically, ElasWave demonstrates up to $1.60\times$ throughput gains over TorchFT and $1.35\times$ over ReCycle, sub-second MTTR improvements for communicator recovery, and $78\%$ reduction in convergence deviation, indicating strong practical impact for robust large-model pretraining on fluctuating hyperscale clusters.

Abstract

Large-scale LLM pretraining now runs across $10^5$--$10^6$ accelerators, making failures routine and elasticity mandatory. We posit that an elastic-native training system must jointly deliver (i) parameter consistency, (ii) low mean time to recovery (MTTR), (iii) high post-change throughput, and (iv) computation consistency. No prior system achieves all four simultaneously. To achieve these goals, we present ElasWave, which delivers per-step fault tolerance via multi-dimensional scheduling across graph, dataflow, DVFS, and RNG. ElasWave reshapes and reshards micro-batches while preserving the global batch size and gradient scale. It performs online pipeline resharding with asynchronous parameter migration and interleaves ZeRO partitions, reducing parameter recovery processes to disjoint rank-to-rank transfers. It further leverages DVFS to absorb pipeline bubbles and reshards RNG to keep computation consistency. Together, a dynamic communicator enables in-place communication group edits, while per-step in-memory snapshots support online verification and redistribution. We evaluate ElasWave on 96 NPUs and benchmark it against state-of-the-art baselines: throughput improves by $1.35\times$ over ReCycle and $1.60\times$ over TorchFT; communicator recovery completes within one second (up to $82\times/3.6\times$ faster than full/partial rebuilds); migration MTTR drops by as much as $51\%$; and convergence deviation is reduced by approximately $78\%$.

ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training

TL;DR

throughput gains over TorchFT and

over ReCycle, sub-second MTTR improvements for communicator recovery, and

reduction in convergence deviation, indicating strong practical impact for robust large-model pretraining on fluctuating hyperscale clusters.

Abstract

Large-scale LLM pretraining now runs across

accelerators, making failures routine and elasticity mandatory. We posit that an elastic-native training system must jointly deliver (i) parameter consistency, (ii) low mean time to recovery (MTTR), (iii) high post-change throughput, and (iv) computation consistency. No prior system achieves all four simultaneously. To achieve these goals, we present ElasWave, which delivers per-step fault tolerance via multi-dimensional scheduling across graph, dataflow, DVFS, and RNG. ElasWave reshapes and reshards micro-batches while preserving the global batch size and gradient scale. It performs online pipeline resharding with asynchronous parameter migration and interleaves ZeRO partitions, reducing parameter recovery processes to disjoint rank-to-rank transfers. It further leverages DVFS to absorb pipeline bubbles and reshards RNG to keep computation consistency. Together, a dynamic communicator enables in-place communication group edits, while per-step in-memory snapshots support online verification and redistribution. We evaluate ElasWave on 96 NPUs and benchmark it against state-of-the-art baselines: throughput improves by

over ReCycle and

over TorchFT; communicator recovery completes within one second (up to

faster than full/partial rebuilds); migration MTTR drops by as much as

; and convergence deviation is reduced by approximately

ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training

TL;DR

Abstract

ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)