Table of Contents
Fetching ...

How Smooth Is Attention?

Valérie Castin, Pierre Ablin, Gabriel Peyré

TL;DR

This work analyzes the smoothness of Transformer self-attention through local Lipschitz constants in both unmasked and masked settings. It develops Euclidean and mean-field formulations, establishing tight bounds that scale as √n in moderate regimes and become independent of n in a mean-field regime, with a separate large-radius regime showing similar behavior for almost all inputs. A novel mean-field framework for masked self-attention is introduced, along with a conditional-transport distance to quantify sequentially conditioned changes. Theoretical results are corroborated by experiments on BERT and GPT-2 that reveal real-data Lipschitz growth around n^{1/4} and adversarial constructions that reach the worst-case √n rate, highlighting implications for robustness and architectural design of Transformers.

Abstract

Self-attention and masked self-attention are at the heart of Transformers' outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties - which are key when it comes to analyzing robustness and expressive power - is incomplete. We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios, discussing the impact of the sequence length $n$ and layer normalization on the local Lipschitz constant of both unmasked and masked self-attention. In particular, we show that for inputs of length $n$ in any compact set, the Lipschitz constant of self-attention is bounded by $\sqrt{n}$ up to a constant factor and that this bound is tight for reasonable sequence lengths. When the sequence length $n$ is too large for the previous bound to be tight, which we refer to as the mean-field regime, we provide an upper bound and a matching lower bound which are independent of $n$. Our mean-field framework for masked self-attention is novel and of independent interest. Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings.

How Smooth Is Attention?

TL;DR

This work analyzes the smoothness of Transformer self-attention through local Lipschitz constants in both unmasked and masked settings. It develops Euclidean and mean-field formulations, establishing tight bounds that scale as √n in moderate regimes and become independent of n in a mean-field regime, with a separate large-radius regime showing similar behavior for almost all inputs. A novel mean-field framework for masked self-attention is introduced, along with a conditional-transport distance to quantify sequentially conditioned changes. Theoretical results are corroborated by experiments on BERT and GPT-2 that reveal real-data Lipschitz growth around n^{1/4} and adversarial constructions that reach the worst-case √n rate, highlighting implications for robustness and architectural design of Transformers.

Abstract

Self-attention and masked self-attention are at the heart of Transformers' outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties - which are key when it comes to analyzing robustness and expressive power - is incomplete. We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios, discussing the impact of the sequence length and layer normalization on the local Lipschitz constant of both unmasked and masked self-attention. In particular, we show that for inputs of length in any compact set, the Lipschitz constant of self-attention is bounded by up to a constant factor and that this bound is tight for reasonable sequence lengths. When the sequence length is too large for the previous bound to be tight, which we refer to as the mean-field regime, we provide an upper bound and a matching lower bound which are independent of . Our mean-field framework for masked self-attention is novel and of independent interest. Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings.
Paper Structure (56 sections, 29 theorems, 103 equations, 6 figures)

This paper contains 56 sections, 29 theorems, 103 equations, 6 figures.

Key Result

Lemma 3.1

Let $\mathcal{X}$ be an open and connected subset of $(\mathbb R^d)^n$. Then

Figures (6)

  • Figure 1: Scatter plots of the local Lipschitz constant of self-attention (column 1) and masked self-attention (columns 2 and 3) on text data (upper row) and adversarial data (lower row) as a function of the sequence length $n$. In the upper row, the color encodes the mean radius of inputs $X = (x_1, \dots, x_n)$, defined as $R\coloneqq \sqrt{1/n\sum_{i=1}^n\lvert x_i\rvert^2}$. Lighter points have a smaller mean radius. The first two columns correspond to two different pretrained BERT models: an Encoder-only and a Decoder-only, on the same dataset Alice in Wonderland, respectively for attention layers 0 and 6. The third column is obtained with the masked self-attention layer 6 of GPT-2 randomly initialized, on the dataset AG_NEWS. We see that the Lipschitz constant of self-attention on real data grows approximately like $n^{1/4}$ with the sequence length $n$ and that the growth rate is $\sqrt{n}$ for adversarial data, which shows the tightness of Theorems \ref{['thm:unnorm_general_bound']}, \ref{['thm:unnorm_large_R_regime']}, \ref{['thm:masked_general_bound']} and \ref{['thm:large_R_masked']}.
  • Figure 2: Plot of the scaling factor $2R^2 \gamma$ across layers of BERT pretrained for three different heads and 5 text extracts of Alice in Wonderland (50 tokens for each extract).
  • Figure 3: Linear growth of the square root of the Lipschitz constant of self-attention in the configuration $X_R(n)$.
  • Figure 4: Scatter plots of the local Lipschitz constant of masked self-attention for GPT-2 pretrained as a function of the sequence length, on the dataset Alice in Wonderland. The first column corresponds to masked self-attention layer 0, and the second column to layer 6.
  • Figure 5: Norm of the positional embeddings of GPT-2 pretrained, ordered by position. The very first tokens are associated to positional embeddings of much larger magnitude, which makes $n$ and $R$ dependent from the very beginning of the architecture.
  • ...and 1 more figures

Theorems & Definitions (49)

  • Definition 2.1: Single-head self-attention
  • Definition 2.2: Multi-head self-attention
  • Definition 2.3: santambrogio2015optimal
  • Definition 2.4: Mean-field self-attention
  • Definition 2.5: Masked self-attention
  • Definition 2.6: Mean-field masked self-attention
  • Lemma 3.1: federer2014geometric
  • Lemma 3.2
  • Theorem 3.3
  • Proposition 3.4
  • ...and 39 more