Exact Attention Sensitivity and the Geometry of Transformer Stability
Seyed Morteza Emadi
TL;DR
This work develops a geometry-aligned theory of transformer stability that explains why pre-LayerNorm stabilizes training, why DeepNorm uses the $N^{-1/4}$ scaling, and why warmup is necessary. It introduces a block-$\infty$/RMS norm that yields length-free Lipschitz bounds and derives the exact softmax Jacobian norm $\|J_{softmax}\|_{\infty\to1}=\theta(p)/\tau$, with the balanced-mass factor $\theta(p)\in[0,1]$ governing sensitivity. The analysis shows pre-LN provides an additive identity gradient path, whereas post-LN enforces gradient flow through LayerNorm Jacobians, causing depth-dependent contraction; the quartic structure of attention yields the $N^{-1/4}$ scaling observed in DeepNorm. Crucially, the empirical findings reveal $\theta(p) \approx 1$ persists during training, indicating stability is architectural rather than driven by attention sharpening. The paper offers a practical design rule—scale each multiplicative map in the dominant sensitivity pathway by $N^{-1/m}$ where $m$ is the number of maps—together with warmup analogs via temperature $\tau$, guiding robust deep-transformer design beyond conventional heuristics.
Abstract
Despite powering modern AI, transformers remain mysteriously brittle to train. We develop a stability theory that explains why pre-LayerNorm works, why DeepNorm uses $N^{-1/4}$ scaling, and why warmup is necessary, all from first principles. Our framework has two pillars: (1) We derive the \emph{exact} operator norm of the softmax Jacobian, $\|J_{softmax}(u/τ)\|_{\infty\to 1} = θ(p)/τ$, where the balanced-mass factor $θ(p)\in[0,1]$ quantifies attention sensitivity. (2) We introduce a block-$\infty$/RMS geometry aligned with tokenwise computation, yielding Lipschitz bounds independent of sequence length. Using this framework, we prove that pre-LN preserves identity gradient paths while post-LN compounds LayerNorm Jacobians exponentially with depth, and we show that DeepNorm's $N^{-1/4}$ emerges from the quartic structure of attention's four projection matrices. We validate our theory on 774M-parameter models and find that, contrary to the intuition that attention sharpens during training to reduce sensitivity, $θ(p) \approx 1$ persists throughout. Transformer stability arises entirely from architectural gradient flow, not from attention dynamics. This finding changes how we reason about training: the architecture itself must handle sensitivity, not learned attention patterns.
