Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training
Seyed Morteza Emadi
TL;DR
The paper introduces a rank-aware, geometry-driven calibration framework to stabilize FP8 training for transformers by bounding attention logits with a per-layer spectral-norm-based scale. It derives a tighter, rank-aware probabilistic guarantee for overflow risk, enabling principled selection of a calibration factor $\alpha$ and a per-layer scale $\text{scale}^{(\ell)}$ that adapts to weight geometry. An efficient, implicit power-iteration procedure estimates the interaction matrix's spectral norm without forming large matrices, supporting grouped query attention and RoPE extensions, and a memory-safe auto-$\alpha$ scheme tunes utilization during steady-state training. Empirical results across GPT-2 XL to Llama-2-70B demonstrate zero overflows in transient scenarios while maintaining comparable MMLU accuracy, with modest overhead and improved FP8 dynamic range usage. The approach offers a practical, theoretically grounded path to reliable low-precision transformer training that remains compatible with fused attention kernels.
Abstract
Attention scores in transformers are bilinear forms $S_{ij} = x_i^\top M x_j / \sqrt{d_h}$ whose maximum magnitude governs overflow risk in low-precision training. We derive a \emph{rank-aware concentration inequality}: when the interaction matrix $M = W^Q W^{K\top}$ has rank $r \ll d$, tail probabilities for $\max_{i,j}|S_{ij}|$ decay as $\exp(-d^{2}α^{2}/(γr))$ rather than $\exp(-dα^{2})$, where $γ> 1$ is a typicality parameter. For transformer attention where $r = d_h$, this yields $8$--$28\times$ tighter concentration than rank-agnostic bounds in modern architectures. We apply this result to FP8 training, deriving \emph{geometry-aware scale factors} that provide principled overflow guarantees without observing activations. The method computes per-layer scales from the spectral norm $\|W^Q W^{K\top}\|_2$ via implicit power iteration, includes a grouped query attention formulation that avoids key expansion, and remains compatible with fused attention kernels. Across GPT-2 XL to Llama-2-70B, geometry-aware scaling eliminates overflows in transient scenarios where delayed scaling fails, while achieving comparable downstream MMLU accuracy.
