Table of Contents
Fetching ...

ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation

Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, Christos Kozyrakis

TL;DR

ReCycle presents a fault-tolerant training system for large DNNs that dispenses with spare GPUs by exploiting functional redundancy across pipelines and the idle bubbles in pipeline schedules. It combines Adaptive Pipelining, Decoupled BackProp, and a Staggered Optimizer, coordinated by a Planner that uses Failure Normalization and MILP-based schedule generation to handle multiple failures and re-joins with minimal throughput degradation. Across real hardware and simulations, ReCycle outperforms prior fault-tolerance approaches (Bamboo and Oobleck), achieving up to 1.64× higher throughput and sustaining training with significant failure rates while efficiently leveraging memory. The key contribution is a practical, planner-guided, bubble-aware approach that preserves convergence and enables scalable training of trillion-parameter models under dynamic resource availability.

Abstract

Training large Deep Neural Network (DNN) models requires thousands of GPUs over the course of several days or weeks. At this scale, failures are frequent and can have a big impact on training throughput. Utilizing spare GPU servers to mitigate performance loss becomes increasingly costly as model sizes grow. ReCycle is a system designed for efficient DNN training in the presence of failures, without relying on spare servers. It exploits the inherent functional redundancy in distributed training systems -- where servers across data-parallel groups store the same model parameters -- and pipeline schedule bubbles within each data-parallel group. When servers fails, ReCycle dynamically re-routes micro-batches to data-parallel peers, allowing for uninterrupted training despite multiple failures. However, this re-routing can create imbalances across pipeline stages, leading to reduced training throughput. To address this, ReCycle introduces two key optimizations that ensure re-routed micro-batches are processed within the original pipeline schedule's bubbles. First, it decouples the backward pass into two phases: one for computing gradients for the input and another for calculating gradients for the parameters. Second, it avoids synchronization across pipeline stages by staggering the optimizer step. Together, these optimizations enable adaptive pipeline schedules that minimize or even eliminate training throughput degradation during failures. We describe a prototype for ReCycle and show that it achieves high training throughput under multiple failures, outperforming recent proposals for fault-tolerant training such as Oobleck and Bamboo by up to $1.46\times$ and $1.64\times$, respectively.

ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation

TL;DR

ReCycle presents a fault-tolerant training system for large DNNs that dispenses with spare GPUs by exploiting functional redundancy across pipelines and the idle bubbles in pipeline schedules. It combines Adaptive Pipelining, Decoupled BackProp, and a Staggered Optimizer, coordinated by a Planner that uses Failure Normalization and MILP-based schedule generation to handle multiple failures and re-joins with minimal throughput degradation. Across real hardware and simulations, ReCycle outperforms prior fault-tolerance approaches (Bamboo and Oobleck), achieving up to 1.64× higher throughput and sustaining training with significant failure rates while efficiently leveraging memory. The key contribution is a practical, planner-guided, bubble-aware approach that preserves convergence and enables scalable training of trillion-parameter models under dynamic resource availability.

Abstract

Training large Deep Neural Network (DNN) models requires thousands of GPUs over the course of several days or weeks. At this scale, failures are frequent and can have a big impact on training throughput. Utilizing spare GPU servers to mitigate performance loss becomes increasingly costly as model sizes grow. ReCycle is a system designed for efficient DNN training in the presence of failures, without relying on spare servers. It exploits the inherent functional redundancy in distributed training systems -- where servers across data-parallel groups store the same model parameters -- and pipeline schedule bubbles within each data-parallel group. When servers fails, ReCycle dynamically re-routes micro-batches to data-parallel peers, allowing for uninterrupted training despite multiple failures. However, this re-routing can create imbalances across pipeline stages, leading to reduced training throughput. To address this, ReCycle introduces two key optimizations that ensure re-routed micro-batches are processed within the original pipeline schedule's bubbles. First, it decouples the backward pass into two phases: one for computing gradients for the input and another for calculating gradients for the parameters. Second, it avoids synchronization across pipeline stages by staggering the optimizer step. Together, these optimizations enable adaptive pipeline schedules that minimize or even eliminate training throughput degradation during failures. We describe a prototype for ReCycle and show that it achieves high training throughput under multiple failures, outperforming recent proposals for fault-tolerant training such as Oobleck and Bamboo by up to and , respectively.
Paper Structure (31 sections, 7 equations, 13 figures, 2 tables, 1 algorithm)

This paper contains 31 sections, 7 equations, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: Illustration of hybrid parallelism. Pipeline stages are denoted with different colors. Within each pipeline stage, operators are partitioned through tensor parallelism. The global batch is split into micro-batches across pipelines.
  • Figure 2: Adaptive pipelining when $W_{1\textunderscore2}$ fails. The micro-batches from worker $W_{1\textunderscore1}$, originally intended for $W_{1\textunderscore2}$, are dynamically re-routed to workers $W_{0\textunderscore2}$ and $W_{2\textunderscore2}$, ensuring that the training process continues without interruption.
  • Figure 3: Hybrid-parallel training across 12 workers with 3 data-parallel pipelines, each with 4 pipeline stages, and 6 micro batches. \ref{['fig:fault-free-schedule']} shows a 1F1B fault-free schedule. \ref{['fig:adaptive-schedule']} shows how Adaptive Pipelining re-routes micro-batches from failed worker $W_{1\textunderscore2}$ to its functional peers $W_{0\textunderscore2}$ and $W_{2\textunderscore2}$.
  • Figure 4: Forward and Backward pass for an operator.
  • Figure 5: Optimized schedule with Decoupled BackProp when worker $W_{1\textunderscore2}$ fails.
  • ...and 8 more figures