ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, Christos Kozyrakis
TL;DR
ReCycle presents a fault-tolerant training system for large DNNs that dispenses with spare GPUs by exploiting functional redundancy across pipelines and the idle bubbles in pipeline schedules. It combines Adaptive Pipelining, Decoupled BackProp, and a Staggered Optimizer, coordinated by a Planner that uses Failure Normalization and MILP-based schedule generation to handle multiple failures and re-joins with minimal throughput degradation. Across real hardware and simulations, ReCycle outperforms prior fault-tolerance approaches (Bamboo and Oobleck), achieving up to 1.64× higher throughput and sustaining training with significant failure rates while efficiently leveraging memory. The key contribution is a practical, planner-guided, bubble-aware approach that preserves convergence and enables scalable training of trillion-parameter models under dynamic resource availability.
Abstract
Training large Deep Neural Network (DNN) models requires thousands of GPUs over the course of several days or weeks. At this scale, failures are frequent and can have a big impact on training throughput. Utilizing spare GPU servers to mitigate performance loss becomes increasingly costly as model sizes grow. ReCycle is a system designed for efficient DNN training in the presence of failures, without relying on spare servers. It exploits the inherent functional redundancy in distributed training systems -- where servers across data-parallel groups store the same model parameters -- and pipeline schedule bubbles within each data-parallel group. When servers fails, ReCycle dynamically re-routes micro-batches to data-parallel peers, allowing for uninterrupted training despite multiple failures. However, this re-routing can create imbalances across pipeline stages, leading to reduced training throughput. To address this, ReCycle introduces two key optimizations that ensure re-routed micro-batches are processed within the original pipeline schedule's bubbles. First, it decouples the backward pass into two phases: one for computing gradients for the input and another for calculating gradients for the parameters. Second, it avoids synchronization across pipeline stages by staggering the optimizer step. Together, these optimizations enable adaptive pipeline schedules that minimize or even eliminate training throughput degradation during failures. We describe a prototype for ReCycle and show that it achieves high training throughput under multiple failures, outperforming recent proposals for fault-tolerant training such as Oobleck and Bamboo by up to $1.46\times$ and $1.64\times$, respectively.
