Table of Contents
Fetching ...

What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis

Weronika Ormaniec, Felix Dangel, Sidak Pal Singh

TL;DR

This work tackles why Transformer optimization behaves differently from MLPs/CNNs by deriving the exact Hessian of a single self-attention layer and decomposing it into outer-product and functional components ${f H} = {f H}_{ ext{o}} + {f H}_{ ext{f}}$. It reveals that the Hessian exhibits highly nonlinear, block-heterogeneous dependencies on data, weight matrices, and attention moments ${f M}_1, {f M}_2, {f M}_3$, with detailed block structures across ${f W}_{ m Q}$, ${f W}_{ m K}$, and ${f W}_{ m V}$. The study further demonstrates how Transformer design choices, notably softmax activation and the two-matrix query-key parameterization, induce nonlinearity and indefiniteness in the Hessian, while alternatives like linear attention or Pre-LN can dampen data-driven heterogeneity. These insights explain empirical observations about the need for adaptive optimizers and normalization and provide a principled foundation for Hessian-aware optimization and architecture design in Transformers.

Abstract

The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and functionality from most other architectural components in deep learning--to the extent that, in comparison to MLPs/CNNs, Transformers are more often accompanied by adaptive optimizers, layer normalization, learning rate warmup, etc. The root causes behind these outward manifestations and the precise mechanisms that govern them remain poorly understood. In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures--grounded in a theoretical comparison of the (loss) Hessian. Concretely, for a single self-attention layer, (a) we first entirely derive the Transformer's Hessian and express it in matrix derivatives; (b) we then characterize it in terms of data, weight, and attention moment dependencies; and (c) while doing so further highlight the important structural differences to the Hessian of classical networks. Our results suggest that various common architectural and optimization choices in Transformers can be traced back to their highly non-linear dependencies on the data and weight matrices, which vary heterogeneously across parameters. Ultimately, our findings provide a deeper understanding of the Transformer's unique optimization landscape and the challenges it poses.

What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis

TL;DR

This work tackles why Transformer optimization behaves differently from MLPs/CNNs by deriving the exact Hessian of a single self-attention layer and decomposing it into outer-product and functional components . It reveals that the Hessian exhibits highly nonlinear, block-heterogeneous dependencies on data, weight matrices, and attention moments , with detailed block structures across , , and . The study further demonstrates how Transformer design choices, notably softmax activation and the two-matrix query-key parameterization, induce nonlinearity and indefiniteness in the Hessian, while alternatives like linear attention or Pre-LN can dampen data-driven heterogeneity. These insights explain empirical observations about the need for adaptive optimizers and normalization and provide a principled foundation for Hessian-aware optimization and architecture design in Transformers.

Abstract

The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and functionality from most other architectural components in deep learning--to the extent that, in comparison to MLPs/CNNs, Transformers are more often accompanied by adaptive optimizers, layer normalization, learning rate warmup, etc. The root causes behind these outward manifestations and the precise mechanisms that govern them remain poorly understood. In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures--grounded in a theoretical comparison of the (loss) Hessian. Concretely, for a single self-attention layer, (a) we first entirely derive the Transformer's Hessian and express it in matrix derivatives; (b) we then characterize it in terms of data, weight, and attention moment dependencies; and (c) while doing so further highlight the important structural differences to the Hessian of classical networks. Our results suggest that various common architectural and optimization choices in Transformers can be traced back to their highly non-linear dependencies on the data and weight matrices, which vary heterogeneously across parameters. Ultimately, our findings provide a deeper understanding of the Transformer's unique optimization landscape and the challenges it poses.

Paper Structure

This paper contains 39 sections, 12 theorems, 102 equations, 7 figures, 1 table.

Key Result

Theorem 3.1

Outer-product Hessian $\mathbf{H}_{\mathrm{o}}$. For a single self-attention layer, eq:self-attention, with classical self-attention that feeds into the square loss, the blocks of $\mathbf{H}_{\text{o}}$ are with the first attention moment matrix ${\bf M}_1:={\bf A}\mathcolor{RedPrint}{\mathbf{X}} \in \mathds{R}^{L \times d_V}$ (see ssec:attention-moment-matrices) and where ${{\bf Z}_1:= (\mathbf

Figures (7)

  • Figure 1: Disparity in the Hessian blocks of a Transformer seen quantitatively and qualitatively. We used a single-block GPT-2 Transformer at initialization applied to the next token prediction task (for details see \ref{['sec:experimental-setup']}). We observe block heterogeneity in the magnitudes of Hessian entries---those of the query block are significantly smaller than those of the value block.
  • Figure 2: Construction of the second derivative matrix of a matrix-valued network $\mathbf{F}$. Taking the second derivative of $\mathbf{F}$ using row-wise vectorization and numerator layout is equivalent to computing the second derivatives of each entry separately and stacking them into a column block matrix.
  • Figure 3: (Plotted in log-log scale.) Empirical verification with a CE loss confirms derived growth rates w.r.t. magnitude $\sigma$ of $\mathcolor{RedPrint}{\mathbf{X}}$ from \ref{['tab:hessian_inter_data']}. We show the growth rates through the Frobenius norm $\|\cdot \|_\text{F}$ of value and query diagonal blocks. The dashed lines correspond to the trend (a) predicted by theory as in \ref{['tab:hessian_inter_data']}, (b) estimated from the Frobenius norm measurements on the log-log scale by the linear regression slope. For details on the experimental setting, see \ref{['sec:experimental-setup']}. $\sigma < 1$ (LHS of the vertical line) corresponds to practical values of $\sigma$.
  • Figure 4: (Plotted in linear scale.) Empirical verification with a CE loss confirms derived growth rates w.r.t. magnitude $\sigma$ of $\mathcolor{RedPrint}{\mathbf{X}}$ from \ref{['tab:hessian_inter_data']}. We demonstrate the growth rates through the Frobenius norm $\|\cdot \|_\text{F}$ of value and query diagonal blocks for (a) practical range $\sigma \in (0, 1)$ and (b) bigger $\sigma$ values $\sigma \in (0, 10)$. The dashed lines correspond to the trend predicted by theory as in \ref{['tab:hessian_inter_data']}. For details on the experimental setting, see \ref{['sec:experimental-setup']}. This figure presents the same data as in \ref{['fig:frob-no-ln']} but using a linear scale on both axes instead of a log-log scale.
  • Figure 5: (Plotted in log-log scale.) Value and query Hessian diagonal blocks at different layers follow the predicted theoretical growth rates for practical ranges of the input. Frobenius norm $\|\cdot \|_\text{F}$ of the self-attention Hessian blocks for multi-layer GPT-2 Transformers without layer normalization on the next token prediction task, split by Transformer block ($1$ corresponds to the input Transformer block). We indicate the growth rates predicted by \ref{['th:self_attention_outer_product', 'th:self_attention_functional']} with the gray dashed lines and the annotation in the bottom right corners. As for the single layer, the complete Hessian ${\bf H}$ value and query blocks follow the trend of the outer-product and functional Hessian blocks respectively.
  • ...and 2 more figures

Theorems & Definitions (31)

  • Theorem 3.1
  • Theorem 3.2
  • Definition 3.1
  • Remark 3.1
  • Remark 4.1
  • Lemma 4.1
  • Definition A.1
  • Definition A.2
  • proof
  • proof
  • ...and 21 more