Table of Contents
Fetching ...

Transformers Learn Low Sensitivity Functions: Investigations and Implications

Bhavya Vasudeva, Deqing Fu, Tianyi Zhou, Elliott Kau, Youqi Huang, Vatsal Sharan

TL;DR

The paper shows that Transformers exhibit a low-sensitivity bias to token-wise input perturbations across vision and language tasks, distinguishing them from CNNs, MLPs, and LSTMs. It extends the Boolean-sensitivity concept to real-valued data, defines a practical sensitivity measure, and provides theoretical support via a weak spectral bias in linear-attention kernels. Empirically, lower sensitivity correlates with improved robustness, flatter minima, and even provides a progress signal for grokking, with consistent results across synthetic data, ViT versus CNN comparisons, and RoBERTa versus LSTM analyses. These findings highlight a unified inductive bias of Transformers that has direct implications for robustness, generalization, and training dynamics, suggesting sensitivity as a practical tool for model analysis and design.

Abstract

Transformers achieve state-of-the-art accuracy and robustness across many tasks, but an understanding of their inductive biases and how those biases differ from other neural network architectures remains elusive. In this work, we identify the sensitivity of the model to token-wise random perturbations in the input as a unified metric which explains the inductive bias of transformers across different data modalities and distinguishes them from other architectures. We show that transformers have lower sensitivity than MLPs, CNNs, ConvMixers and LSTMs, across both vision and language tasks. We also show that this low-sensitivity bias has important implications: i) lower sensitivity correlates with improved robustness; it can also be used as an efficient intervention to further improve the robustness of transformers; ii) it corresponds to flatter minima in the loss landscape; and iii) it can serve as a progress measure for grokking. We support these findings with theoretical results showing (weak) spectral bias of transformers in the NTK regime, and improved robustness due to the lower sensitivity. The code is available at https://github.com/estija/sensitivity.

Transformers Learn Low Sensitivity Functions: Investigations and Implications

TL;DR

The paper shows that Transformers exhibit a low-sensitivity bias to token-wise input perturbations across vision and language tasks, distinguishing them from CNNs, MLPs, and LSTMs. It extends the Boolean-sensitivity concept to real-valued data, defines a practical sensitivity measure, and provides theoretical support via a weak spectral bias in linear-attention kernels. Empirically, lower sensitivity correlates with improved robustness, flatter minima, and even provides a progress signal for grokking, with consistent results across synthetic data, ViT versus CNN comparisons, and RoBERTa versus LSTM analyses. These findings highlight a unified inductive bias of Transformers that has direct implications for robustness, generalization, and training dynamics, suggesting sensitivity as a practical tool for model analysis and design.

Abstract

Transformers achieve state-of-the-art accuracy and robustness across many tasks, but an understanding of their inductive biases and how those biases differ from other neural network architectures remains elusive. In this work, we identify the sensitivity of the model to token-wise random perturbations in the input as a unified metric which explains the inductive bias of transformers across different data modalities and distinguishes them from other architectures. We show that transformers have lower sensitivity than MLPs, CNNs, ConvMixers and LSTMs, across both vision and language tasks. We also show that this low-sensitivity bias has important implications: i) lower sensitivity correlates with improved robustness; it can also be used as an efficient intervention to further improve the robustness of transformers; ii) it corresponds to flatter minima in the loss landscape; and iii) it can serve as a progress measure for grokking. We support these findings with theoretical results showing (weak) spectral bias of transformers in the NTK regime, and improved robustness due to the lower sensitivity. The code is available at https://github.com/estija/sensitivity.
Paper Structure (57 sections, 7 theorems, 12 equations, 24 figures, 4 tables)

This paper contains 57 sections, 7 theorems, 12 equations, 24 figures, 4 tables.

Key Result

Proposition 3.1

Let $K$ be the CK or NTK of a transformer with linear attention on a Boolean cube $\text{\mancube}^{d}$. For any ${\boldsymbol{x}},{\boldsymbol{y}}\in\text{\mancube}^{d}$, we can write $K({\boldsymbol{x}},{\boldsymbol{y}})=\Psi(\left\langle {\boldsymbol{x}}, {\boldsymbol{y}} \right\rangle)$ for some where $\mathbf{1} := (1,\dots,1) \in \text{\mancube}^{d}$, and the eigenvalues $\mu_k$, $k \in [d]$

Figures (24)

  • Figure 1: Measuring Sensitivity in Vision Tasks. A patch is first selected to add Gaussian noise corruptions. Then the original image and the corrupted image are fed into the sameneural network to make predictions. If the predictions are inconsistent, then the neural network is sensitive to this patch. The process is repeated for every patch to measure the overall sensitivity.
  • Figure 2: Visualization of the synthetic data generation process (see \ref{['sec:synth']} for details). For simplicity, we represent each $d$-dimensional token with a square. Middle row: In each case, given a label $y$, we randomly sample $T=11$ tokens, with $n_s$ tokens from $\mathcal{V}_{\text{sparse}}^y$, $\left\lfloor (n_f+n_d)/2 \right\rfloor$ tokens from $\mathcal{V}_{\text{frequent}}^y$, $n_f\!-\!\left\lfloor (n_f+n_d)/2\right\rfloor$ tokens from $\mathcal{V}_{\text{frequent}}^{-y}$ and the remaining tokens from $\mathcal{V}_{\text{irrelevant}}$. Note that in the first example, since $n_s\!=\!3$ and $n_d\!=\!1$, a predictor that relies (only) on the sparse tokens is less sensitive compared to the one that relies on the frequent tokens. On the other hand, in the second example, since $n_s\!=\!1$ and $n_d\!=\!3$, the predictor that relies on the frequent tokens is less sensitive. Bottom row: We include two sentiment analysis-based examples to illustrate the synthetic data samples in the second row, using the same colors as the first two rows.
  • Figure 3: Comparison of sensitivity values for models that use only sparse or frequent tokens for the settings considered in \ref{['fig:synth-joint2']}.
  • Figure 4: Sensitivity on CIFAR-10. Comparison of the sensitivity of two CNNs, two ViTs, and ConvMixer trained on the CIFAR-10 dataset, as a function of training epochs. For a fair comparison, the figure also shows the train accuracies (see App. \ref{['fig:cifar-full']} for full train dynamics). All models have similar accuracies but the ViTs have significantly lower sensitivity.
  • Figure 5: Sensitivity over Datapoints Trained. On both datasets, the Transformer-based model RoBERTa displays much lower sensitivity compared to LSTMs during the entire training process. RoBERTa with ReLU activation has lower sensitivity compared to its Softmax counterpart at later stages of training.
  • ...and 19 more figures

Theorems & Definitions (11)

  • Proposition 3.1
  • Proposition 3.2
  • Definition 4.1
  • Definition 4.2: Synthetic Vocabulary
  • Definition 4.3: Dataset Generation
  • Definition 5.1: Tokenization for Vision Transformers
  • Theorem B.1: Theorem 3 in hron2020infinite
  • Theorem B.2: Theorem 3.2 in yang2020finegrained
  • Theorem B.3: Theorem 4.1 in yang2020finegrained
  • Theorem B.4: Theorem 2.49 in o'donnell_2014
  • ...and 1 more