Why "classic" Transformers are shallow and how to make them go deep

Yueyao Yu; Yin Zhang

Why "classic" Transformers are shallow and how to make them go deep

Yueyao Yu, Yin Zhang

TL;DR

This work identifies token similarity escalation (TSE) as the fundamental reason deep classic Transformers underperform, formalizing $t_{sim}(X)$ and $t_{div}(X)$ and showing that self-attention plus residual drives representations toward the span of $\mathds{1}$. The authors provide a theoretical analysis linking TSE to the invariant leading eigenspace and the spectral gap of the attention matrix, deriving a lower bound on the escalation rate $\mathbb{E}[r(X,Y)]$ that scales with $t_{sim}(X)$ and depends on $|\lambda_2(P)|$. They propose a simple de-escalation operation $Y=(I-\tau\Pi_\mathds{1})X$ (often $\tau=1$) to surgically reduce excessive similarity, and demonstrate, on ViT-CIFAR10 and Transformer-XL-WikiText-103, that this strategy markedly improves deep post-norm models and can approach pre-norm performance. The findings offer a practical pathway to deeper Transformer architectures without fully discarding the self-attention mechanism, with implications for scaling large language and vision models.

Abstract

Since its introduction in 2017, Transformer has emerged as the leading neural network architecture, catalyzing revolutionary advancements in many AI disciplines. The key innovation in Transformer is a Self-Attention (SA) mechanism designed to capture contextual information. However, extending the original Transformer design to models of greater depth has proven exceedingly challenging, if not impossible. Even though various modifications have been proposed in order to stack more layers of SA mechanism into deeper models, a full understanding of this depth problem remains lacking. In this paper, we conduct a comprehensive investigation, both theoretically and empirically, to substantiate the claim that the depth problem is caused by \emph{token similarity escalation}; that is, tokens grow increasingly alike after repeated applications of the SA mechanism. Our analysis reveals that, driven by the invariant leading eigenspace and large spectral gaps of attention matrices, token similarity provably escalates at a linear rate. Based on the gained insight, we propose a new strategy of surgically removing excessive similarity in contrast to the existing approach of diminishing the SA mechanism explicitly or implicitly (such as in pre-norm transformers). Preliminary experimental results confirm the effectiveness of the proposed strategy in small-scale post-norm Transformer models.

Why "classic" Transformers are shallow and how to make them go deep

TL;DR

This work identifies token similarity escalation (TSE) as the fundamental reason deep classic Transformers underperform, formalizing

and

and showing that self-attention plus residual drives representations toward the span of

. The authors provide a theoretical analysis linking TSE to the invariant leading eigenspace and the spectral gap of the attention matrix, deriving a lower bound on the escalation rate

that scales with

and depends on

. They propose a simple de-escalation operation

(often

) to surgically reduce excessive similarity, and demonstrate, on ViT-CIFAR10 and Transformer-XL-WikiText-103, that this strategy markedly improves deep post-norm models and can approach pre-norm performance. The findings offer a practical pathway to deeper Transformer architectures without fully discarding the self-attention mechanism, with implications for scaling large language and vision models.

Abstract

Paper Structure (25 sections, 11 theorems, 55 equations, 8 figures, 1 table)

This paper contains 25 sections, 11 theorems, 55 equations, 8 figures, 1 table.

Introduction
Token Similarity
Related Works
Contributions
Notation
Analysis of TSE in Transformer
Transformer Architecture
How Self-Attention Drives TSE
An intuitive interpretation
A theoretical analysis
Other Steps Do Not Impact TSE
Experimental Verification
Discussion
Mitigation of TSE in Transformers
Implicit Mitigation in Pre-norm
...and 10 more sections

Key Result

Proposition 2.3

Given $X, Y \in \mathbb{R}^{n\times d}$ with $\mathbf{t}_{sim}(X), \mathbf{t}_{sim}(Y) \in (0,1)$, let $r(X,Y)$ be the escalation rate defined in def:r(X,Y). Then the following identity holds where Therefore, $r(X,Y) > 1$ if and only if $\xi_1>\xi_2$.

Figures (8)

Figure 1: Values of token similarity (left), cosine similarity (middle), and gradient norm (right) at each block in 2 Transformer models at default initialization.
Figure 2: Average values of token similarity, $\xi_1/\xi_2$, $\delta$ and $\omega$ over 50 trials. At each multi-head block, $\delta$ and $\omega$ are also averaged across the eight heads.
Figure 3: Average values over 1000 trials of token similarity, $\xi_1/\xi_2-1$ and $r(X,Y)$ and their estimates from \ref{['E5:rate']} and \ref{["if2:P=P'"]} for the latter two quantities
Figure 4: Left: Token similarity in a pre-norm transformer model. Right: Frobenius norms of input $X$ (blue), $\hat{X}$ (orange) and $P(\hat{X})\hat{X}W$ (green), see \ref{['def:pre-norm']} for definitions, where attention matrices are computed by the softmax formula.
Figure 5: The averaged values of 20 runs about Token Diversity in De-escalated Transformers with softmax attention, where each block is the same as Algorithm 1. There are three de-escalation values: $\tau$ = 0.1, 0.5 and 1.
...and 3 more figures

Theorems & Definitions (21)

Definition 1.1
Definition 2.2
Proposition 2.3
proof
Lemma 2.4
Proposition 2.5
Theorem 2.6
Corollary 2.7
Remark 2.8
Proposition 2.9
...and 11 more

Why "classic" Transformers are shallow and how to make them go deep

TL;DR

Abstract

Why "classic" Transformers are shallow and how to make them go deep

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (21)