Table of Contents
Fetching ...

TrainMover: An Interruption-Resilient and Reliable ML Training Runtime

ChonLam Lao, Minlan Yu, Aditya Akella, Jiamin Cao, Yu Guan, Pengcheng Zhang, Zhilong Zheng, Yichi Xu, Ennan Zhai, Dennis Cai, Jiaqi Gao

TL;DR

TrainMover tackles the pervasive downtime and inefficiency of large-scale LLM training under interruptions by combining a two-phase, delta-based NCCL setup with a communication-free sandbox that runs shadow iterations. It leverages standby machines to migrate work without restarting the training or altering parallelization, while overlapping preparation with ongoing execution to keep memory footprints zero. Empirical results show sub-10-second downtime and up to ~70× speedups over checkpoint-based baselines, with 99% training throughput during frequent rebalancings and robust recovery during unexpected failures. The approach promises practical benefits for large-scale, tightly coupled distributed training in dynamic data-center environments.

Abstract

Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpointing or runtime reconfiguration suffer from long downtimes, degraded performance, or undesired changes to training strategies. We present TrainMover, a resilient runtime that leverages standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces two key techniques: two-phase, delta-based communication group setups and communication-free sandboxed shadow iterations. Our evaluation shows that TrainMover consistently achieves second-level downtime across all evaluated models during migration, maintaining 99\% training efficiency during periodic 10-minute rebalancing. We also demonstrate the effectiveness of TrainMover in handling various interruptions.

TrainMover: An Interruption-Resilient and Reliable ML Training Runtime

TL;DR

TrainMover tackles the pervasive downtime and inefficiency of large-scale LLM training under interruptions by combining a two-phase, delta-based NCCL setup with a communication-free sandbox that runs shadow iterations. It leverages standby machines to migrate work without restarting the training or altering parallelization, while overlapping preparation with ongoing execution to keep memory footprints zero. Empirical results show sub-10-second downtime and up to ~70× speedups over checkpoint-based baselines, with 99% training throughput during frequent rebalancings and robust recovery during unexpected failures. The approach promises practical benefits for large-scale, tightly coupled distributed training in dynamic data-center environments.

Abstract

Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpointing or runtime reconfiguration suffer from long downtimes, degraded performance, or undesired changes to training strategies. We present TrainMover, a resilient runtime that leverages standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces two key techniques: two-phase, delta-based communication group setups and communication-free sandboxed shadow iterations. Our evaluation shows that TrainMover consistently achieves second-level downtime across all evaluated models during migration, maintaining 99\% training efficiency during periodic 10-minute rebalancing. We also demonstrate the effectiveness of TrainMover in handling various interruptions.

Paper Structure

This paper contains 26 sections, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Cluster status of two training jobs at different scales. Slowness is defined as a 10% iteration time increase.
  • Figure 2: Training efficiency comparison with GPU/Network changes every 10 minutes
  • Figure 3: TrainMover Overview
  • Figure 4: Time breakdown of NCCL setup components with and without the CUDA_VISIBLE_DEVICES flag.
  • Figure 5: CCL Migration workflow
  • ...and 11 more figures