Table of Contents
Fetching ...

Understanding Silent Data Corruption in LLM Training

Jeffrey Ma, Hengzhi Pei, Leonard Lausen, George Karypis

TL;DR

This work investigates real-world silent data corruption (SDC) during large-language-model (LLM) training by pairing unhealthy nodes with healthy ones under deterministic execution, and develops two synchronization mechanisms to isolate SDC effects. It analyzes SDC impact at three granular levels—submodule outputs, a single optimizer step, and the full training period—revealing that SDCs vary across nodes, can push models to different optima, and occasionally cause training-loss spikes that fully corrupt models. Despite small average perturbations to submodule computations and gradients, SDCs can significantly alter the learned parameters and trajectory, underscoring the need for robust detection and mitigation strategies. The findings highlight a complex relationship between SDCs and the loss landscape, offering concrete directions for improving SDC resilience in future LLM training. Practical implications include the potential use of recomputation or shadow replicas to detect SDCs and a focus on loss-surface characteristics to mitigate their impact.

Abstract

As the scale of training large language models (LLMs) increases, one emergent failure is silent data corruption (SDC), where hardware produces incorrect computations without explicit failure signals. In this work, we are the first to investigate the impact of real-world SDCs on LLM training by comparing model training between healthy production nodes and unhealthy nodes exhibiting SDCs. With the help from a cloud computing platform, we access the unhealthy nodes that were swept out from production by automated fleet management. Using deterministic execution via XLA compiler and our proposed synchronization mechanisms, we isolate and analyze the impact of SDC errors on these nodes at three levels: at each submodule computation, at a single optimizer step, and at a training period. Our results reveal that the impact of SDCs on computation varies on different unhealthy nodes. Although in most cases the perturbations from SDCs on submodule computation and gradients are relatively small, SDCs can lead models to converge to different optima with different weights and even cause spikes in the training loss. Our analysis sheds light on further understanding and mitigating the impact of SDCs.

Understanding Silent Data Corruption in LLM Training

TL;DR

This work investigates real-world silent data corruption (SDC) during large-language-model (LLM) training by pairing unhealthy nodes with healthy ones under deterministic execution, and develops two synchronization mechanisms to isolate SDC effects. It analyzes SDC impact at three granular levels—submodule outputs, a single optimizer step, and the full training period—revealing that SDCs vary across nodes, can push models to different optima, and occasionally cause training-loss spikes that fully corrupt models. Despite small average perturbations to submodule computations and gradients, SDCs can significantly alter the learned parameters and trajectory, underscoring the need for robust detection and mitigation strategies. The findings highlight a complex relationship between SDCs and the loss landscape, offering concrete directions for improving SDC resilience in future LLM training. Practical implications include the potential use of recomputation or shadow replicas to detect SDCs and a focus on loss-surface characteristics to mitigate their impact.

Abstract

As the scale of training large language models (LLMs) increases, one emergent failure is silent data corruption (SDC), where hardware produces incorrect computations without explicit failure signals. In this work, we are the first to investigate the impact of real-world SDCs on LLM training by comparing model training between healthy production nodes and unhealthy nodes exhibiting SDCs. With the help from a cloud computing platform, we access the unhealthy nodes that were swept out from production by automated fleet management. Using deterministic execution via XLA compiler and our proposed synchronization mechanisms, we isolate and analyze the impact of SDC errors on these nodes at three levels: at each submodule computation, at a single optimizer step, and at a training period. Our results reveal that the impact of SDCs on computation varies on different unhealthy nodes. Although in most cases the perturbations from SDCs on submodule computation and gradients are relatively small, SDCs can lead models to converge to different optima with different weights and even cause spikes in the training loss. Our analysis sheds light on further understanding and mitigating the impact of SDCs.

Paper Structure

This paper contains 42 sections, 8 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Illustration of fleet management flow, where nodes are vetted through several rounds of testing at different granularities.
  • Figure 2: Illustration of our lock-step parallelism works in a Transformer decoder layer. The arrows indicate intermediate tensors corrected by communicating values from the healthy to the unhealthy node (red in forwards pass, blue in backwards pass). In forward pass, $g$ is an all-gather and $\bar{g}$ is a reduce-scatter, while in the backwards pass $g$ is an reduce-scatter and $\bar{g}$ is an all-gather.
  • Figure 3: Non-uniform spikes of mismatch frequency in the forward computation of the attention module over time on Node 7, 14.
  • Figure 4: High SDC occurrence with large initial spikes in smoothed mismatch frequency for the forward computation of the attention module on Node 10, 11.
  • Figure 5: $L_2$-norm of the gradient difference and the ground-truth gradients over steps. The left table shows Worst Case Noise-to-Signal (WCNTS) ratios for unhealthy nodes.
  • ...and 6 more figures