Numerical Error Analysis of Large Language Models
Stanislav Budzinskiy, Wenyi Fang, Longbin Zeng, Philipp Petersen
TL;DR
This work provides a rigorous forward error analysis for the forward pass of decoder‑style transformers under finite precision. By deriving Jacobian‑based condition numbers for layer normalization, two‑layer perceptrons, and self‑attention, it aggregates into a depth‑dependent bound showing relative errors can grow exponentially with the number of blocks, with growth tied to the conditioning of $W^{ op}W_q$ and related norms. The authors also present a detailed deterministic rounding framework that yields explicit per‑layer error terms and a main theorem bounding the output error of a deep transformer, corroborated by numerical experiments that reveal mean‑versus‑median discrepancies and practical mitigation strategies such as higher‑precision computation for attention components. The results yield concrete guidelines for stable transformer inference, including precision management and conditioning controls, with potential extensions to mixed‑precision schemes and backward/ training analysis. Overall, the paper provides quantitative insights into how finite‑precision effects propagate through transformer forward passes and how to design more robust, numerically stable LLM inference pipelines.
Abstract
Large language models based on transformer architectures have become integral to state-of-the-art natural language processing applications. However, their training remains computationally expensive and exhibits instabilities, some of which are expected to be caused by finite-precision computations. We provide a theoretical analysis of the impact of round-off errors within the forward pass of a transformer architecture which yields fundamental bounds for these effects. In addition, we conduct a series of numerical experiments which demonstrate the practical relevance of our bounds. Our results yield concrete guidelines for choosing hyperparameters that mitigate round-off errors, leading to more robust and stable inference.
