SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

Jin Lee; Zhonghao Chen; Xuhang He; Robert Underwood; Bogdan Nicolae; Franck Cappello; Xiaoyi Lu; Sheng Di; Zheng Zhang

SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

Jin Lee, Zhonghao Chen, Xuhang He, Robert Underwood, Bogdan Nicolae, Franck Cappello, Xiaoyi Lu, Sheng Di, Zheng Zhang

TL;DR

SPARe - Stacked Parallelism with Adaptive Reordering - a fault-tolerance framework that masks node failures during gradient synchronization by stacking redundant data shards across parallelism groups and adaptively reordering execution achieves availability comparable to traditional replication while maintaining near-constant computation overhead of only 2~3x.

Abstract

In large-scale LLM pre-training systems with 100k+ GPUs, failures become the norm rather than the exception, and restart costs can dominate wall-clock training time. However, existing fault-tolerance mechanisms are largely unprepared for this restart-dominant regime. To address this challenge, we propose SPARe - Stacked Parallelism with Adaptive Reordering - a fault-tolerance framework that masks node failures during gradient synchronization by stacking redundant data shards across parallelism groups and adaptively reordering execution. SPARe achieves availability comparable to traditional replication while maintaining near-constant computation overhead of only 2~3x, even under high redundancy where traditional replication would require linearly inflating overhead. We derive closed-form expressions for endurable failure count and computation overhead, validate them via SimGrid-based discrete-event simulation, and jointly optimize redundancy and checkpointing to minimize time-to-train. At extreme scale with up to 600k GPUs, SPARe reduces time-to-train by 40~50% compared to traditional replication.

SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

TL;DR

Abstract

Paper Structure (32 sections, 4 theorems, 57 equations, 9 figures, 6 tables, 2 algorithms)

This paper contains 32 sections, 4 theorems, 57 equations, 9 figures, 6 tables, 2 algorithms.

Introduction
Background
Synchronous Data Parallelism
Node Failures in Large Parallel HPC System
Partial Recovery: Avoiding Global Restart
Traditional Replication: Robust yet Expensive
SPARe: Stacked Parallelism with Adaptive Reordering
Key Ideas of SPARe
Algorithm Flow of SPARe
Theoretical Analysis of SPARe
Theoretical Results about SPARe
Joint Optimization with Checkpointing
System Performance Evaluation
Realistic System Parameters
Performance Evaluation Results
...and 17 more sections

Key Result

Theorem 4.1

The average failure count $\mu(N,r)$ SPARe can mask before the first wipe-out is asymptotically: where $\Gamma$ refers to the Gamma function in nistlib.

Figures (9)

Figure 1: Synchronous Data Parallelism.
Figure 2: Traditional Replication $r=3$.
Figure 3: (a): Example of SPARe at $N=9$, $r=3$. (b): Before any failure, all partial gradients can be collected after computing the $1^{\mathrm{st}}$ stack. (c): With group $1$ failure, system needs to compute up to $2^{\mathrm{nd}}$ stack to collect all types. (d): If group $2$ fails later, type $2$ partial gradient cannot be collected within the $2^{\mathrm{nd}}$ stack of shards. (e): However, all partial gradients can be collected after computing up to $2^{\mathrm{nd}}$ stack when group $8$ stack is reordered.
Figure 4: Average endurable failure count by redundancy $r$.
Figure 5: Average computation overhead by redundancy $r$.
...and 4 more figures

Theorems & Definitions (12)

Theorem 4.1: Average Failure Count
proof
Theorem 4.2: Average Computation Overhead
proof
Theorem 4.3: Optimal $r^\star$ for minimal time-to-train
proof
Definition 2.1: Cyclic Golomb Ruler distribution rule
Lemma 2.2
proof
proof
...and 2 more

SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

TL;DR

Abstract

SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (12)