Table of Contents
Fetching ...

Towards Understanding the Word Sensitivity of Attention Layers: A Study via Random Features

Simone Bombari, Marco Mondelli

TL;DR

The paper addresses why attention layers excel in NLP by formalizing word sensitivity (WS) in a random-feature setting. It shows that standard random features have WS scaling as $\mathcal{O}(1/\sqrt{n})$, while random attention features exhibit WS of constant order $\Omega(1)$, driven by the softmax in attention. The authors prove theoretical results establishing low WS and associated negative generalization for RF (and DRF) but high WS enabling generalization for RAF, complemented by experiments on $\text{BERT-Base}$ embeddings of IMDb data. The work highlights softmax-empowered WS as a fundamental property distinguishing attention from fully connected layers, with implications for the design and analysis of transformer-based models.

Abstract

Understanding the reasons behind the exceptional success of transformers requires a better analysis of why attention layers are suitable for NLP tasks. In particular, such tasks require predictive models to capture contextual meaning which often depends on one or few words, even if the sentence is long. Our work studies this key property, dubbed word sensitivity (WS), in the prototypical setting of random features. We show that attention layers enjoy high WS, namely, there exists a vector in the space of embeddings that largely perturbs the random attention features map. The argument critically exploits the role of the softmax in the attention layer, highlighting its benefit compared to other activations (e.g., ReLU). In contrast, the WS of standard random features is of order $1/\sqrt{n}$, $n$ being the number of words in the textual sample, and thus it decays with the length of the context. We then translate these results on the word sensitivity into generalization bounds: due to their low WS, random features provably cannot learn to distinguish between two sentences that differ only in a single word; in contrast, due to their high WS, random attention features have higher generalization capabilities. We validate our theoretical results with experimental evidence over the BERT-Base word embeddings of the imdb review dataset.

Towards Understanding the Word Sensitivity of Attention Layers: A Study via Random Features

TL;DR

The paper addresses why attention layers excel in NLP by formalizing word sensitivity (WS) in a random-feature setting. It shows that standard random features have WS scaling as , while random attention features exhibit WS of constant order , driven by the softmax in attention. The authors prove theoretical results establishing low WS and associated negative generalization for RF (and DRF) but high WS enabling generalization for RAF, complemented by experiments on embeddings of IMDb data. The work highlights softmax-empowered WS as a fundamental property distinguishing attention from fully connected layers, with implications for the design and analysis of transformer-based models.

Abstract

Understanding the reasons behind the exceptional success of transformers requires a better analysis of why attention layers are suitable for NLP tasks. In particular, such tasks require predictive models to capture contextual meaning which often depends on one or few words, even if the sentence is long. Our work studies this key property, dubbed word sensitivity (WS), in the prototypical setting of random features. We show that attention layers enjoy high WS, namely, there exists a vector in the space of embeddings that largely perturbs the random attention features map. The argument critically exploits the role of the softmax in the attention layer, highlighting its benefit compared to other activations (e.g., ReLU). In contrast, the WS of standard random features is of order , being the number of words in the textual sample, and thus it decays with the length of the context. We then translate these results on the word sensitivity into generalization bounds: due to their low WS, random features provably cannot learn to distinguish between two sentences that differ only in a single word; in contrast, due to their high WS, random attention features have higher generalization capabilities. We validate our theoretical results with experimental evidence over the BERT-Base word embeddings of the imdb review dataset.
Paper Structure (33 sections, 15 theorems, 149 equations, 7 figures, 1 table)

This paper contains 33 sections, 15 theorems, 149 equations, 7 figures, 1 table.

Key Result

Theorem 1

Let $\varphi_{\textup{RF}}$ be the random features map defined in eq:rf, where $\phi$ is Lipschitz and not identically $0$. Let $X \in \mathbb{R}^{n \times d}$ be a generic input sample s.t. Assumption ass:d holds, and assume $k = \Omega(D)$. Let $\mathcal{S}_{\textup{RF}}(X)$ denote the word sensit with probability at least $1 - \exp(-c D)$ over $V$.

Figures (7)

  • Figure 1: Left. Average attention scores for the word embeddings of the two sentences "I love her much" and "I love her smile". The embeddings are computed with the BERT-Base model, the scores are averaged over the 12 heads and displayed without the [CLS] token. Right. Output of the Llama2-7b-chat model for two prompts differing only in a single word.
  • Figure 2: Numerical estimate of the WS for the RF (left) and DRF (right) map, with a ReLU activation function. We estimate $\mathcal{S}_{\varphi}$ looking for the perturbation $\Delta^* = \arg \sup_{\left\|\Delta\right\|_2 \leq \sqrt{d}} \left\|\varphi(X^1(\Delta)) - \varphi(X)\right\|_2$, where we fix the first token for symmetry. We find $\Delta^*$ by optimizing our objective with constrained gradient ascent. For RF, we consider $d\in\{192, 384, 768\}$, as $n$ increases (taking the first $d$ dimensions of the embeddings of the first $n$ tokens). For DRF, we repeat the experiment for different depths $L\in\{2, 4, 8\}$ and fixed $d = 768$, as $n$ increases. As textual data $X$, we use the BERT-Base token embeddings of samples from the imdb dataset, after a pre-processing to adapt the dimensions and fulfill Assumption \ref{['ass:d']}. We plot the average over 10 independent trials and the confidence band at 1 standard deviation. In the figure on the left, we intentionally dash the plotted lines to ease the visualization, as they overlap.
  • Figure 3: Numerical estimate of the WS for the RAF (first plot) and ReLU-RAF (second plot) map. The ReLU-RAF map is defined as the RAF one, but the $\mathop{\mathrm{softmax}}\nolimits$ is replaced with a ReLU activation over the entries of $S(X)$, followed by a re-normalization to ensure that the attention scores sum up (on every row) to 1, as in the $\mathop{\mathrm{softmax}}\nolimits$ case. We consider $d={192, 384, 768}$, as $n$ increases. The rest of the setup is equivalent to the one described in Figure \ref{['fig:RF_S']}. In the third plot, we present the relative change of the pre-trained BERT-Base model layer embeddings, evaluated on the abstract of this paper ($\sim$ 200 tokens), when the 42nd token is modified in embedding space with a vector $\Delta$. The $0$-th layer represents the input itself and, as a comparison, we report the results when the perturbation is chosen to be Gaussian noise with the same norm. In the fourth plot, we present the attention scores in the first head of the first layer of BERT-Base model, evaluated on the title of this paper, when the 6th token is modified in embedding space with a vector $\Delta$. In the third and fourth plots, $\Delta$ is chosen to be a perturbation that attracts all the attention on the perturbed key token, which follows the proof idea of Theorem \ref{['thm:RAF']}.
  • Figure 4: Test error (as defined in \ref{['eq:error']} taking $i=1$) for the RF (left subplot), RAF (two central sub-plots) and ReLU-RAF (right subplot) maps, as a function of the smallest $\gamma$ s.t. Assumption \ref{['ass:uncertainity']} holds. The first (resp. second) row considers the fine-tuned solution $\theta^*_f$ (resp. re-trained solution $\theta^*_r$). Every sub-plot has a fixed embedding dimension $d = 768$, and context length $n\in \{40, 120\}$, taking the first $n$ token embeddings for each sample. Different colors correspond to a different number of training samples $N \in \{100, 700, 1300\}$. Every point in the scatter-plots is an independent simulation where $(X, y)$ and $(\mathcal{X}, \mathcal{Y})$ are the BERT-Base embeddings of a random subset of the imdb dataset (after pre-processing to fulfill Assumption \ref{['ass:d']}). Circular markers correspond to obtaining $\Delta$ via gradient descent optimization of the losses in \ref{['eq:lossf']}; cross markers correspond to minimizing directly the test error in \ref{['eq:error']}.
  • Figure 5: $\textup{Err}_{\varphi}(X^i(\Delta), \theta^*_{f/r})$ for the RAF (two left sub-plots) and ReLU-RAF (two right subplots) maps, as a function of the smallest $\gamma$ for which Assumption \ref{['ass:uncertainity']} is satisfied. Every sub-plot has a fixed context length $n = \{40, 120\}$, embedding dimension $d = 768$ and number of training samples $N = 400$. Every point in the scatter-plots represents an independent simulation where $(X, y)$ and $(\mathcal{X}, \mathcal{Y})$ are the BERT-Base embeddings of a random subset of the imdb dataset (after pre-processing to fulfill Assumption \ref{['ass:d']}). For every point, $\Delta$ is obtained through constrained gradient descent optimization of $\ell_{\textup{Err}, p}(\Delta)$, defined in \ref{['eq:penalized_loss']}, for different values of the penalty term $p = \{1, 0.1, 0.01\}$.
  • ...and 2 more figures

Theorems & Definitions (33)

  • Theorem 1
  • Remark 4.1
  • Theorem 2
  • Theorem 3
  • Remark 5.1
  • Theorem 4
  • Theorem 5
  • Lemma B.1
  • proof
  • proof
  • ...and 23 more