Setting the Record Straight on Transformer Oversmoothing

Gbètondji J-S Dovonon; Michael M. Bronstein; Matt J. Kusner

Setting the Record Straight on Transformer Oversmoothing

Gbètondji J-S Dovonon, Michael M. Bronstein, Matt J. Kusner

TL;DR

This work challenges the notion that Transformer oversmoothing is inevitable by combining theoretical eigenvalue analysis with empirical evidence across vision and language models. It shows smoothing behavior hinges on the joint spectrum of the attention-derived matrix $\mathbf{A}$ and the weight composition $\mathbf{H}$, with a single dominant eigenvalue driving input/angle/rank convergence and alternative dominants yielding partial or no smoothing. The authors propose a practical reparameterization of the update, via $\mathbf{H} = \mathbf{V}_H \Lambda_H \mathbf{V}_H^{-1}$ with clipped diagonal entries, to steer smoothing toward sharpening or smoothing, while highlighting that layer normalization sign can modulate these effects. Their findings offer actionable guidance for designing future Transformer architectures and training regimes, with implications for controlling information propagation in deep models.

Abstract

Transformer-based models have recently become wildly successful across a diverse set of domains. At the same time, recent work has shown empirically and theoretically that Transformers are inherently limited. Specifically, they argue that as model depth increases, Transformers oversmooth, i.e., inputs become more and more similar. A natural question is: How can Transformers achieve these successes given this shortcoming? In this work we test these observations empirically and theoretically and uncover a number of surprising findings. We find that there are cases where feature similarity increases but, contrary to prior results, this is not inevitable, even for existing pre-trained models. Theoretically, we show that smoothing behavior depends on the eigenspectrum of the value and projection weights. We verify this empirically and observe that the sign of layer normalization weights can influence this effect. Our analysis reveals a simple way to parameterize the weights of the Transformer update equations to influence smoothing behavior. We hope that our findings give ML researchers and practitioners additional insight into how to develop future Transformer-based models.

Setting the Record Straight on Transformer Oversmoothing

TL;DR

and the weight composition

, with a single dominant eigenvalue driving input/angle/rank convergence and alternative dominants yielding partial or no smoothing. The authors propose a practical reparameterization of the update, via

with clipped diagonal entries, to steer smoothing toward sharpening or smoothing, while highlighting that layer normalization sign can modulate these effects. Their findings offer actionable guidance for designing future Transformer architectures and training regimes, with implications for controlling information propagation in deep models.

Abstract

Paper Structure (28 sections, 16 theorems, 18 equations, 5 figures, 3 tables)

This paper contains 28 sections, 16 theorems, 18 equations, 5 figures, 3 tables.

Introduction
Background & Related Work
The Transformer Update
What Is Oversmoothing?
Input Convergence.
Angle Convergence.
Rank Collapse.
Observations of Transformer Oversmoothing
The Theory of Transformer Oversmoothing
Input Convergence.
Angle Convergence.
Rank Convergence.
Do Transformers Always Oversmooth?
Preliminaries
The Eigenvalues
...and 13 more sections

Key Result

Proposition 1

Given Assumption assume:A, all eigenvalues of ${\mathbf{A}}$ lie within $(-1,1]$. There is one largest eigenvalue that is equal to $1$, with corresponding unique eigenvector $\boldsymbol{1}$.

Figures (5)

Figure 1: Theory of Transformer Oversmoothing. A ✓ indicates prior work says that the corresponding Definition is always satisfied, an ✗ indicates it is not always satisfied. Note that if a work argues a Definition is satisfied, then all later Definitions, which are progressively more relaxed, must also be satisfied.
Figure 2: Smoothing behavior. The smoothing metrics defined in Definitions \ref{['def:low-pass']}-\ref{['def:rank_collapse']} for different models and datasets in vision and NLP. See text for details.
Figure 3: Influencing smoothing. The smoothing metrics defined in Definitions \ref{['def:low-pass']}-\ref{['def:rank_collapse']} for different models and datasets when ${\mathbf{H}}$ is reparameterized as ${\mathbf{H}} = {\mathbf{V}}_H \Lambda_H {\mathbf{V}}^{-1}_H$. See text for details.
Figure 4: Impact of Layer Normalization. The average $\mathrm{HFC}/\mathrm{LFC}$ for the Transformer update with repeated layers eq. (\ref{['eq:vec_update']}) and different types of layer normalization (Post-LN vaswani2017attention, Pre-LN baevski2018adaptive) where the weights of the layer normalization are fixed to be positive or negative. See text for details.
Figure 5: Distributions of eigenvalues of ${\mathbf{H}}$ (Top) Vision models have distributions skewing to the negatives; (Bottom) Language models have symmetrically distributed eigenvalues.

Theorems & Definitions (27)

Definition 1: Input Convergence wang2022anti
Definition 2: Angle Convergence
Definition 3: Rank Collapse
Proposition 1: meyer2023matrix
Lemma 1
Definition 4: Dominating eigenvalue(s)
Theorem 1
Theorem 2
Corollary 1
Theorem 3
...and 17 more

Setting the Record Straight on Transformer Oversmoothing

TL;DR

Abstract

Setting the Record Straight on Transformer Oversmoothing

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (27)