TrainMover: An Interruption-Resilient and Reliable ML Training Runtime
ChonLam Lao, Minlan Yu, Aditya Akella, Jiamin Cao, Yu Guan, Pengcheng Zhang, Zhilong Zheng, Yichi Xu, Ennan Zhai, Dennis Cai, Jiaqi Gao
TL;DR
TrainMover tackles the pervasive downtime and inefficiency of large-scale LLM training under interruptions by combining a two-phase, delta-based NCCL setup with a communication-free sandbox that runs shadow iterations. It leverages standby machines to migrate work without restarting the training or altering parallelization, while overlapping preparation with ongoing execution to keep memory footprints zero. Empirical results show sub-10-second downtime and up to ~70× speedups over checkpoint-based baselines, with 99% training throughput during frequent rebalancings and robust recovery during unexpected failures. The approach promises practical benefits for large-scale, tightly coupled distributed training in dynamic data-center environments.
Abstract
Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpointing or runtime reconfiguration suffer from long downtimes, degraded performance, or undesired changes to training strategies. We present TrainMover, a resilient runtime that leverages standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces two key techniques: two-phase, delta-based communication group setups and communication-free sandboxed shadow iterations. Our evaluation shows that TrainMover consistently achieves second-level downtime across all evaluated models during migration, maintaining 99\% training efficiency during periodic 10-minute rebalancing. We also demonstrate the effectiveness of TrainMover in handling various interruptions.
