What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
Weronika Ormaniec, Felix Dangel, Sidak Pal Singh
TL;DR
This work tackles why Transformer optimization behaves differently from MLPs/CNNs by deriving the exact Hessian of a single self-attention layer and decomposing it into outer-product and functional components ${f H} = {f H}_{ ext{o}} + {f H}_{ ext{f}}$. It reveals that the Hessian exhibits highly nonlinear, block-heterogeneous dependencies on data, weight matrices, and attention moments ${f M}_1, {f M}_2, {f M}_3$, with detailed block structures across ${f W}_{ m Q}$, ${f W}_{ m K}$, and ${f W}_{ m V}$. The study further demonstrates how Transformer design choices, notably softmax activation and the two-matrix query-key parameterization, induce nonlinearity and indefiniteness in the Hessian, while alternatives like linear attention or Pre-LN can dampen data-driven heterogeneity. These insights explain empirical observations about the need for adaptive optimizers and normalization and provide a principled foundation for Hessian-aware optimization and architecture design in Transformers.
Abstract
The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and functionality from most other architectural components in deep learning--to the extent that, in comparison to MLPs/CNNs, Transformers are more often accompanied by adaptive optimizers, layer normalization, learning rate warmup, etc. The root causes behind these outward manifestations and the precise mechanisms that govern them remain poorly understood. In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures--grounded in a theoretical comparison of the (loss) Hessian. Concretely, for a single self-attention layer, (a) we first entirely derive the Transformer's Hessian and express it in matrix derivatives; (b) we then characterize it in terms of data, weight, and attention moment dependencies; and (c) while doing so further highlight the important structural differences to the Hessian of classical networks. Our results suggest that various common architectural and optimization choices in Transformers can be traced back to their highly non-linear dependencies on the data and weight matrices, which vary heterogeneously across parameters. Ultimately, our findings provide a deeper understanding of the Transformer's unique optimization landscape and the challenges it poses.
