Exact Attention Sensitivity and the Geometry of Transformer Stability

Seyed Morteza Emadi

Exact Attention Sensitivity and the Geometry of Transformer Stability

Seyed Morteza Emadi

TL;DR

This work develops a geometry-aligned theory of transformer stability that explains why pre-LayerNorm stabilizes training, why DeepNorm uses the $N^{-1/4}$ scaling, and why warmup is necessary. It introduces a block-$\infty$/RMS norm that yields length-free Lipschitz bounds and derives the exact softmax Jacobian norm $\|J_{softmax}\|_{\infty\to1}=\theta(p)/\tau$, with the balanced-mass factor $\theta(p)\in[0,1]$ governing sensitivity. The analysis shows pre-LN provides an additive identity gradient path, whereas post-LN enforces gradient flow through LayerNorm Jacobians, causing depth-dependent contraction; the quartic structure of attention yields the $N^{-1/4}$ scaling observed in DeepNorm. Crucially, the empirical findings reveal $\theta(p) \approx 1$ persists during training, indicating stability is architectural rather than driven by attention sharpening. The paper offers a practical design rule—scale each multiplicative map in the dominant sensitivity pathway by $N^{-1/m}$ where $m$ is the number of maps—together with warmup analogs via temperature $\tau$, guiding robust deep-transformer design beyond conventional heuristics.

Abstract

Despite powering modern AI, transformers remain mysteriously brittle to train. We develop a stability theory that explains why pre-LayerNorm works, why DeepNorm uses $N^{-1/4}$ scaling, and why warmup is necessary, all from first principles. Our framework has two pillars: (1) We derive the \emph{exact} operator norm of the softmax Jacobian, $\|J_{softmax}(u/τ)\|_{\infty\to 1} = θ(p)/τ$, where the balanced-mass factor $θ(p)\in[0,1]$ quantifies attention sensitivity. (2) We introduce a block-$\infty$/RMS geometry aligned with tokenwise computation, yielding Lipschitz bounds independent of sequence length. Using this framework, we prove that pre-LN preserves identity gradient paths while post-LN compounds LayerNorm Jacobians exponentially with depth, and we show that DeepNorm's $N^{-1/4}$ emerges from the quartic structure of attention's four projection matrices. We validate our theory on 774M-parameter models and find that, contrary to the intuition that attention sharpens during training to reduce sensitivity, $θ(p) \approx 1$ persists throughout. Transformer stability arises entirely from architectural gradient flow, not from attention dynamics. This finding changes how we reason about training: the architecture itself must handle sensitivity, not learned attention patterns.

Exact Attention Sensitivity and the Geometry of Transformer Stability

TL;DR

This work develops a geometry-aligned theory of transformer stability that explains why pre-LayerNorm stabilizes training, why DeepNorm uses the

scaling, and why warmup is necessary. It introduces a block-

/RMS norm that yields length-free Lipschitz bounds and derives the exact softmax Jacobian norm

, with the balanced-mass factor

governing sensitivity. The analysis shows pre-LN provides an additive identity gradient path, whereas post-LN enforces gradient flow through LayerNorm Jacobians, causing depth-dependent contraction; the quartic structure of attention yields the

scaling observed in DeepNorm. Crucially, the empirical findings reveal

persists during training, indicating stability is architectural rather than driven by attention sharpening. The paper offers a practical design rule—scale each multiplicative map in the dominant sensitivity pathway by

where

is the number of maps—together with warmup analogs via temperature

, guiding robust deep-transformer design beyond conventional heuristics.

Abstract

Despite powering modern AI, transformers remain mysteriously brittle to train. We develop a stability theory that explains why pre-LayerNorm works, why DeepNorm uses

scaling, and why warmup is necessary, all from first principles. Our framework has two pillars: (1) We derive the \emph{exact} operator norm of the softmax Jacobian,

, where the balanced-mass factor

quantifies attention sensitivity. (2) We introduce a block-

/RMS geometry aligned with tokenwise computation, yielding Lipschitz bounds independent of sequence length. Using this framework, we prove that pre-LN preserves identity gradient paths while post-LN compounds LayerNorm Jacobians exponentially with depth, and we show that DeepNorm's

emerges from the quartic structure of attention's four projection matrices. We validate our theory on 774M-parameter models and find that, contrary to the intuition that attention sharpens during training to reduce sensitivity,

persists throughout. Transformer stability arises entirely from architectural gradient flow, not from attention dynamics. This finding changes how we reason about training: the architecture itself must handle sensitivity, not learned attention patterns.

Paper Structure (80 sections, 12 theorems, 75 equations, 6 figures, 1 table)

This paper contains 80 sections, 12 theorems, 75 equations, 6 figures, 1 table.

Introduction
Limitations of existing theory.
Our approach: architecture-aligned geometry.
Contributions.
Related Work
Geometric Framework
Notation and stability concepts.
Why Standard Geometry Fails
Block-$\infty$/RMS Geometry
Intuition.
Exact Softmax Sensitivity
The stability-expressivity trade-off.
Implications for training dynamics.
Layerwise Stability Analysis
Multi-Head Attention Lipschitz Bound
...and 65 more sections

Key Result

Lemma 3.2

For any row-stochastic $A \in \mathbb{R}^{L \times L}$ and $V \in \mathbb{R}^{L \times d}$: $\|AV\|_{\infty,\mathrm{rms}} \le \|V\|_{\infty,\mathrm{rms}}$.

Figures (6)

Figure 1: Training dynamics for 774M-parameter transformers reveal the stability mechanism. Top-left: Pre-LN converges stably (loss 51.5); Post-LN plateaus unstably (60.6). Top-right: Post-LN gradient spike at step 2400 triggers clipping. Bottom-left: Key finding:$\theta(p)/\tau = 1.0$ throughout for both; stability is not from attention sharpening. Bottom-right: Post-LN gradient pathology validates Theorem \ref{['thm:postln-gradient']}: vanishing gradients in layers 1--30, spike at output.
Figure 2: Projection norm product $G_\ell$ over training. Both architectures show $G_\ell$ growing by $\sim 3\times$ (log scale), with the steepest growth immediately following warmup (dashed line at step 500). This confirms warmup protects training during the initial high-drift phase when projection norms are most volatile.
Figure 3: Verification of Theorem \ref{['thm:softmax']}. For $L \le 16$, exhaustive enumeration confirms the identity $\|J_{\mathop{\mathrm{softmax}}\nolimits}\|_{\infty \to 1} = \theta(p)/\tau$ holds to machine precision ($< 10^{-13}$ relative error). For larger $L$, greedy approximation of $\theta(p)$ yields consistent results across 500 random distributions per length.
Figure 4: $\theta(p)$ measured at layers 0, 18, and 35 (first, middle, last) at the end of training. Both architectures show $\theta \approx 1.0$ at all sampled layers.
Figure 5: Sensitivity proxy $S_\ell = (\theta/\tau) \cdot \bar{B}_\ell^2 \cdot G_\ell$ over training. Top: $S_\ell$ values (log scale) showing Pre-LN achieves higher absolute sensitivity but remains stable. Bottom: Coefficient of variation across layers, showing Pre-LN has higher layerwise dispersion.
...and 1 more figures

Theorems & Definitions (28)

Definition 3.1: Block-$\infty$/RMS norm
Lemma 3.2: Attention mixing is nonexpansive
Lemma 3.3: LayerNorm magnitude reset
Definition 4.1: Balanced-mass factor
Theorem 4.2: Exact softmax Jacobian norm
Corollary 4.3: Sensitivity regimes
Theorem 5.1: MHA Lipschitz bound
Theorem 5.2: Pre-LN: full layer bound
Theorem 5.3: Post-LN: full layer bound
Theorem 5.4: Pre-LN: identity gradient path
...and 18 more

Exact Attention Sensitivity and the Geometry of Transformer Stability

TL;DR

Abstract

Exact Attention Sensitivity and the Geometry of Transformer Stability

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (28)