Understanding Silent Data Corruption in LLM Training
Jeffrey Ma, Hengzhi Pei, Leonard Lausen, George Karypis
TL;DR
This work investigates real-world silent data corruption (SDC) during large-language-model (LLM) training by pairing unhealthy nodes with healthy ones under deterministic execution, and develops two synchronization mechanisms to isolate SDC effects. It analyzes SDC impact at three granular levels—submodule outputs, a single optimizer step, and the full training period—revealing that SDCs vary across nodes, can push models to different optima, and occasionally cause training-loss spikes that fully corrupt models. Despite small average perturbations to submodule computations and gradients, SDCs can significantly alter the learned parameters and trajectory, underscoring the need for robust detection and mitigation strategies. The findings highlight a complex relationship between SDCs and the loss landscape, offering concrete directions for improving SDC resilience in future LLM training. Practical implications include the potential use of recomputation or shadow replicas to detect SDCs and a focus on loss-surface characteristics to mitigate their impact.
Abstract
As the scale of training large language models (LLMs) increases, one emergent failure is silent data corruption (SDC), where hardware produces incorrect computations without explicit failure signals. In this work, we are the first to investigate the impact of real-world SDCs on LLM training by comparing model training between healthy production nodes and unhealthy nodes exhibiting SDCs. With the help from a cloud computing platform, we access the unhealthy nodes that were swept out from production by automated fleet management. Using deterministic execution via XLA compiler and our proposed synchronization mechanisms, we isolate and analyze the impact of SDC errors on these nodes at three levels: at each submodule computation, at a single optimizer step, and at a training period. Our results reveal that the impact of SDCs on computation varies on different unhealthy nodes. Although in most cases the perturbations from SDCs on submodule computation and gradients are relatively small, SDCs can lead models to converge to different optima with different weights and even cause spikes in the training loss. Our analysis sheds light on further understanding and mitigating the impact of SDCs.
