Table of Contents
Fetching ...

Why are Sensitive Functions Hard for Transformers?

Michael Hahn, Mark Rofin

TL;DR

The paper shows that Transformers’ learnability biases are governed by input-space sensitivity, linking high-sensitivity functions like PARITY to sharp minima in the loss landscape. By formalizing average sensitivity $as_n(f)$ and the layer-norm blowup $N_i^{(k)}$, it proves lower bounds on sharpness for sensitive functions and demonstrates brittleness in parameter space. The work connects theory to experiments, showing that scratchpads reduce sensitivity, random initialization biases toward low-sensitivity regimes, and LN blowup is essential for learning parity-like tasks. These results shift the focus from purely expressiveness arguments to the geometry of the loss landscape to understand Transformer inductive biases and generalization.

Abstract

Empirical studies have identified a range of learnability biases and limitations of transformers, such as a persistent difficulty in learning to compute simple formal languages such as PARITY, and a bias towards low-degree functions. However, theoretical understanding remains limited, with existing expressiveness theory either overpredicting or underpredicting realistic learning abilities. We prove that, under the transformer architecture, the loss landscape is constrained by the input-space sensitivity: Transformers whose output is sensitive to many parts of the input string inhabit isolated points in parameter space, leading to a low-sensitivity bias in generalization. We show theoretically and empirically that this theory unifies a broad array of empirical observations about the learning abilities and biases of transformers, such as their generalization bias towards low sensitivity and low degree, and difficulty in length generalization for PARITY. This shows that understanding transformers' inductive biases requires studying not just their in-principle expressivity, but also their loss landscape.

Why are Sensitive Functions Hard for Transformers?

TL;DR

The paper shows that Transformers’ learnability biases are governed by input-space sensitivity, linking high-sensitivity functions like PARITY to sharp minima in the loss landscape. By formalizing average sensitivity and the layer-norm blowup , it proves lower bounds on sharpness for sensitive functions and demonstrates brittleness in parameter space. The work connects theory to experiments, showing that scratchpads reduce sensitivity, random initialization biases toward low-sensitivity regimes, and LN blowup is essential for learning parity-like tasks. These results shift the focus from purely expressiveness arguments to the geometry of the loss landscape to understand Transformer inductive biases and generalization.

Abstract

Empirical studies have identified a range of learnability biases and limitations of transformers, such as a persistent difficulty in learning to compute simple formal languages such as PARITY, and a bias towards low-degree functions. However, theoretical understanding remains limited, with existing expressiveness theory either overpredicting or underpredicting realistic learning abilities. We prove that, under the transformer architecture, the loss landscape is constrained by the input-space sensitivity: Transformers whose output is sensitive to many parts of the input string inhabit isolated points in parameter space, leading to a low-sensitivity bias in generalization. We show theoretically and empirically that this theory unifies a broad array of empirical observations about the learning abilities and biases of transformers, such as their generalization bias towards low sensitivity and low degree, and difficulty in length generalization for PARITY. This shows that understanding transformers' inductive biases requires studying not just their in-principle expressivity, but also their loss landscape.
Paper Structure (44 sections, 15 theorems, 104 equations, 15 figures, 2 tables)

This paper contains 44 sections, 15 theorems, 104 equations, 15 figures, 2 tables.

Key Result

Theorem 4

Consider a transformer with layer norm at arbitrary $\epsilon \geq 0$. With probability $1-\frac{H}{n^{2}}$ over the choice of $x \in \{\pm 1\}^n$, we have

Figures (15)

  • Figure 1: Training transformers on inputs of increasing length produces a steeper loss landscape for PARITY (as measured by average direction sharpness), while the loss landscape of MAJORITY does not show significant changes. Our main result (Theorem \ref{['thm:lrho-bound']}) provides a rigorous explanation for this phenomenon.
  • Figure 2: The tradeoff between parameter norm of Transformers trained to approximate PARITY and the blowup of their Layer Normalization layers. The tradeoff depends on the input length; blowup or parameter weights need to increase with the input length (in accordance with Corollary \ref{['thm:bigtheorem']}). This length dependency is not observed with low sensitivity functions (Appendix, Figures \ref{['exp:tradeoff-all-functions-comparison']} and \ref{['exp:tradeoff-4-lengths']}).
  • Figure 3: During training a transformer on PARITY, a sudden drop in the loss coincides with an increase in sharpness. Sharpness decreases again in further training, but asymptotes to a nontrivial value (Appendix, Figure \ref{['fig:dynamics-100k']}). See corresponding curves for weight norm and LN Blowup in Appendix, Figure \ref{['exp:dynamic-main']}.
  • Figure 4: Generalization: When trained on data from a random Boolean function a subset of $\{\pm 1\}^n$ (here: n=10), transformers generalize with reduced sensitivity compared to the actual function. The solutions found have lower sharpness than a solution fitting the actual function. When the training size is smaller, the inferred function is less constrained, and learnt functions have even lower sensitivity.
  • Figure 5: Sharpness as a function of sequence length for all the functions discussed in the paper. As predicted by Theorem \ref{['thm:lrho-bound']}, parameters fitting PARITY have substantial sharpness as inputs get longer. For functions with lower sensitivity, sharpness barely increases with the input length. For PARITY, the sharpness approaches the theoretical asymptotic lower bound of $1$ from Theorem \ref{['thm:lrho-bound']} already at $n \approx 30$. See Figure \ref{['ex:sharpness-all-aligned']} for a version with aligned y-axes.
  • ...and 10 more figures

Theorems & Definitions (33)

  • Definition 3
  • Theorem 4: Local Bounds on Layer Norm Blowup
  • Corollary 5: First Main Result
  • Theorem 6: Second Main Result
  • Theorem 7
  • Lemma 8: Folklore
  • proof
  • proof : Proof of Fact 1
  • Lemma 9: Sensitivity of an Attention Head
  • Remark 10
  • ...and 23 more