Table of Contents
Fetching ...

Setting the Record Straight on Transformer Oversmoothing

Gbètondji J-S Dovonon, Michael M. Bronstein, Matt J. Kusner

TL;DR

This work challenges the notion that Transformer oversmoothing is inevitable by combining theoretical eigenvalue analysis with empirical evidence across vision and language models. It shows smoothing behavior hinges on the joint spectrum of the attention-derived matrix $\mathbf{A}$ and the weight composition $\mathbf{H}$, with a single dominant eigenvalue driving input/angle/rank convergence and alternative dominants yielding partial or no smoothing. The authors propose a practical reparameterization of the update, via $\mathbf{H} = \mathbf{V}_H \Lambda_H \mathbf{V}_H^{-1}$ with clipped diagonal entries, to steer smoothing toward sharpening or smoothing, while highlighting that layer normalization sign can modulate these effects. Their findings offer actionable guidance for designing future Transformer architectures and training regimes, with implications for controlling information propagation in deep models.

Abstract

Transformer-based models have recently become wildly successful across a diverse set of domains. At the same time, recent work has shown empirically and theoretically that Transformers are inherently limited. Specifically, they argue that as model depth increases, Transformers oversmooth, i.e., inputs become more and more similar. A natural question is: How can Transformers achieve these successes given this shortcoming? In this work we test these observations empirically and theoretically and uncover a number of surprising findings. We find that there are cases where feature similarity increases but, contrary to prior results, this is not inevitable, even for existing pre-trained models. Theoretically, we show that smoothing behavior depends on the eigenspectrum of the value and projection weights. We verify this empirically and observe that the sign of layer normalization weights can influence this effect. Our analysis reveals a simple way to parameterize the weights of the Transformer update equations to influence smoothing behavior. We hope that our findings give ML researchers and practitioners additional insight into how to develop future Transformer-based models.

Setting the Record Straight on Transformer Oversmoothing

TL;DR

This work challenges the notion that Transformer oversmoothing is inevitable by combining theoretical eigenvalue analysis with empirical evidence across vision and language models. It shows smoothing behavior hinges on the joint spectrum of the attention-derived matrix and the weight composition , with a single dominant eigenvalue driving input/angle/rank convergence and alternative dominants yielding partial or no smoothing. The authors propose a practical reparameterization of the update, via with clipped diagonal entries, to steer smoothing toward sharpening or smoothing, while highlighting that layer normalization sign can modulate these effects. Their findings offer actionable guidance for designing future Transformer architectures and training regimes, with implications for controlling information propagation in deep models.

Abstract

Transformer-based models have recently become wildly successful across a diverse set of domains. At the same time, recent work has shown empirically and theoretically that Transformers are inherently limited. Specifically, they argue that as model depth increases, Transformers oversmooth, i.e., inputs become more and more similar. A natural question is: How can Transformers achieve these successes given this shortcoming? In this work we test these observations empirically and theoretically and uncover a number of surprising findings. We find that there are cases where feature similarity increases but, contrary to prior results, this is not inevitable, even for existing pre-trained models. Theoretically, we show that smoothing behavior depends on the eigenspectrum of the value and projection weights. We verify this empirically and observe that the sign of layer normalization weights can influence this effect. Our analysis reveals a simple way to parameterize the weights of the Transformer update equations to influence smoothing behavior. We hope that our findings give ML researchers and practitioners additional insight into how to develop future Transformer-based models.
Paper Structure (28 sections, 16 theorems, 18 equations, 5 figures, 3 tables)

This paper contains 28 sections, 16 theorems, 18 equations, 5 figures, 3 tables.

Key Result

Proposition 1

Given Assumption assume:A, all eigenvalues of ${\mathbf{A}}$ lie within $(-1,1]$. There is one largest eigenvalue that is equal to $1$, with corresponding unique eigenvector $\boldsymbol{1}$.

Figures (5)

  • Figure 1: Theory of Transformer Oversmoothing. A ✓ indicates prior work says that the corresponding Definition is always satisfied, an ✗ indicates it is not always satisfied. Note that if a work argues a Definition is satisfied, then all later Definitions, which are progressively more relaxed, must also be satisfied.
  • Figure 2: Smoothing behavior. The smoothing metrics defined in Definitions \ref{['def:low-pass']}-\ref{['def:rank_collapse']} for different models and datasets in vision and NLP. See text for details.
  • Figure 3: Influencing smoothing. The smoothing metrics defined in Definitions \ref{['def:low-pass']}-\ref{['def:rank_collapse']} for different models and datasets when ${\mathbf{H}}$ is reparameterized as ${\mathbf{H}} = {\mathbf{V}}_H \Lambda_H {\mathbf{V}}^{-1}_H$. See text for details.
  • Figure 4: Impact of Layer Normalization. The average $\mathrm{HFC}/\mathrm{LFC}$ for the Transformer update with repeated layers eq. (\ref{['eq:vec_update']}) and different types of layer normalization (Post-LN vaswani2017attention, Pre-LN baevski2018adaptive) where the weights of the layer normalization are fixed to be positive or negative. See text for details.
  • Figure 5: Distributions of eigenvalues of ${\mathbf{H}}$ (Top) Vision models have distributions skewing to the negatives; (Bottom) Language models have symmetrically distributed eigenvalues.

Theorems & Definitions (27)

  • Definition 1: Input Convergence wang2022anti
  • Definition 2: Angle Convergence
  • Definition 3: Rank Collapse
  • Proposition 1: meyer2023matrix
  • Lemma 1
  • Definition 4: Dominating eigenvalue(s)
  • Theorem 1
  • Theorem 2
  • Corollary 1
  • Theorem 3
  • ...and 17 more