Table of Contents
Fetching ...

Self-attention Networks Localize When QK-eigenspectrum Concentrates

Han Bao, Ryuichiro Hataya, Ryo Karakida

TL;DR

<p>The paper investigates when self-attention localizes in sequences and reveals that localization is governed by the eigenspectrum of the query-key parameter matrix $W_{QK}$, specifically its mean and variance. By introducing a signal-propagation framework and a piecewise-linear softmax approximation, it shows that small spectral-variance with a nonzero mean promotes localization, which simultaneously mitigates rank and entropy collapse, improving expressivity and training dynamics. A unifying perspective links previously separate collapse phenomena, arguing that controlling the $W_{QK}$ eigenspectrum can reconcile them and enhance performance. The LocAteR regularization concretely embodies this idea by shrinking the spectrum’s scale while preserving its mean, with empirical evidence from WikiText-2 showing improved perplexity and higher attention entropy under localization.</p>

Abstract

The self-attention mechanism prevails in modern machine learning. It has an interesting functionality of adaptively selecting tokens from an input sequence by modulating the degree of attention localization, which many researchers speculate is the basis of the powerful model performance but complicates the underlying mechanism of the learning dynamics. In recent years, mainly two arguments have connected attention localization to the model performances. One is the rank collapse, where the embedded tokens by a self-attention block become very similar across different tokens, leading to a less expressive network. The other is the entropy collapse, where the attention probability approaches non-uniform and entails low entropy, making the learning dynamics more likely to be trapped in plateaus. These two failure modes may apparently contradict each other because the rank and entropy collapses are relevant to uniform and non-uniform attention, respectively. To this end, we characterize the notion of attention localization by the eigenspectrum of query-key parameter matrices and reveal that a small eigenspectrum variance leads attention to be localized. Interestingly, the small eigenspectrum variance prevents both rank and entropy collapse, leading to better model expressivity and trainability.

Self-attention Networks Localize When QK-eigenspectrum Concentrates

TL;DR

<p>The paper investigates when self-attention localizes in sequences and reveals that localization is governed by the eigenspectrum of the query-key parameter matrix , specifically its mean and variance. By introducing a signal-propagation framework and a piecewise-linear softmax approximation, it shows that small spectral-variance with a nonzero mean promotes localization, which simultaneously mitigates rank and entropy collapse, improving expressivity and training dynamics. A unifying perspective links previously separate collapse phenomena, arguing that controlling the eigenspectrum can reconcile them and enhance performance. The LocAteR regularization concretely embodies this idea by shrinking the spectrum’s scale while preserving its mean, with empirical evidence from WikiText-2 showing improved perplexity and higher attention entropy under localization.</p>

Abstract

The self-attention mechanism prevails in modern machine learning. It has an interesting functionality of adaptively selecting tokens from an input sequence by modulating the degree of attention localization, which many researchers speculate is the basis of the powerful model performance but complicates the underlying mechanism of the learning dynamics. In recent years, mainly two arguments have connected attention localization to the model performances. One is the rank collapse, where the embedded tokens by a self-attention block become very similar across different tokens, leading to a less expressive network. The other is the entropy collapse, where the attention probability approaches non-uniform and entails low entropy, making the learning dynamics more likely to be trapped in plateaus. These two failure modes may apparently contradict each other because the rank and entropy collapses are relevant to uniform and non-uniform attention, respectively. To this end, we characterize the notion of attention localization by the eigenspectrum of query-key parameter matrices and reveal that a small eigenspectrum variance leads attention to be localized. Interestingly, the small eigenspectrum variance prevents both rank and entropy collapse, leading to better model expressivity and trainability.
Paper Structure (29 sections, 6 theorems, 51 equations, 12 figures)

This paper contains 29 sections, 6 theorems, 51 equations, 12 figures.

Key Result

Lemma 1

Suppose that $\mathbf{W}_{\mathrm{QK}}$ is symmetric and independent from $\mathbf{X}$, and let $\mathbf{W} \coloneqq \mathbf{W}_{\mathrm{QK}}\bm{\Sigma}$. Under assumption:random_walk, for $i \in [T]$, the mean $\mu^i$ and variance $v^i$ of $\left\langle{\bm{\gamma}^i},{\bm{\omega}}\right\rangle +

Figures (12)

  • Figure 1: Comparison of softmax $\mathbf{S}$ and the piecewise approximation $\widetilde{\mathbf{S}}$ for two-dimensional inputs.
  • Figure 2: The theoretical plots of the signal propagation probability $\rho(\theta)$ with different $\xi = \mathop{\mathrm{\mathrm{tr}}}\nolimits(\mathbf{W})/\sqrt{\mathop{\mathrm{\mathrm{tr}}}\nolimits(\mathbf{W}^2)}$ and $\eta = \sqrt{\mathop{\mathrm{\mathrm{tr}}}\nolimits(\mathbf{W}^2)}/\lambda^2$. The vertical axes indicate relative token position $\theta = i/T$ ($i$: token index, $T$: number of tokens). Smaller $\theta$ close to zero and larger $\theta$ close to one correspond to early-site and late-site tokens, respectively.
  • Figure 3: The theoretical plots of $\rho(\theta)$. For each $\xi = 128, 512$, the product value $\xi\eta = 1.28, 5.12$, respectively. The latter is sufficiently larger than the localization threshold $r = 2$ and localized.
  • Figure 4: Entropy lower bound \ref{['equation:entropy_lower_bound']} by Zhai2023ICML.
  • Figure 5: Simulated signal propagation probability. In the top and bottom rows, the results for the isotropic and anisotropic covariances (the details in the text) are shown, respectively. (Left) Signal propagation probability $\rho_i$ computed over repeatedly sampled $300$ random walks (\ref{['assumption:random_walk']}) with $40$ tokens. For each line, $\mathbf{W}_{\mathrm{QK}}$ ($d=128$) is sampled $10$ times with the corresponding mean and scale of the eigenvalue distribution, and the averaged $\rho_i$ is denoted by the bold line. (Right) The attention entropy Zhai2023ICML is computed for $\mathbf{W}_{\mathrm{QK}}$ with different eigenvalue mean-scale pairs.
  • ...and 7 more figures

Theorems & Definitions (11)

  • Remark 1
  • Remark 2
  • Definition 1: Signal propagation probability
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 4
  • proof
  • Lemma 4
  • ...and 1 more