Table of Contents
Fetching ...

Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions

Satwik Bhattamishra, Arkil Patel, Varun Kanade, Phil Blunsom

TL;DR

This work analyzes why Transformer models generalize well despite limited formal-language expressiveness by examining inductive biases through Boolean sensitivity. It shows random Transformers and gradient-trained ones are biased toward low-sensitivity, simpler functions, with Transformers often converging to even lower sensitivity than LSTMs. On sparse, low-sensitivity Boolean tasks, Transformers generalize near-perfectly even with noisy labels, while LSTMs tend to overfit, suggesting a bias toward simpler functions may underlie practical success. The discussion situates these findings in the context of AC0 circuit analogies and acknowledges limitations when extending to real-world data and longer sequences.

Abstract

Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer's effective generalization performance despite relatively limited expressiveness.

Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions

TL;DR

This work analyzes why Transformer models generalize well despite limited formal-language expressiveness by examining inductive biases through Boolean sensitivity. It shows random Transformers and gradient-trained ones are biased toward low-sensitivity, simpler functions, with Transformers often converging to even lower sensitivity than LSTMs. On sparse, low-sensitivity Boolean tasks, Transformers generalize near-perfectly even with noisy labels, while LSTMs tend to overfit, suggesting a bias toward simpler functions may underlie practical success. The discussion situates these findings in the context of AC0 circuit analogies and acknowledges limitations when extending to real-world data and longer sequences.

Abstract

Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer's effective generalization performance despite relatively limited expressiveness.
Paper Structure (27 sections, 1 theorem, 6 equations, 23 figures, 1 table)

This paper contains 27 sections, 1 theorem, 6 equations, 23 figures, 1 table.

Key Result

Proposition D.1

For any $\delta >0$, with probability at least $1-\delta$, the following holds for any function $f, \hat{f} \in \mathcal{F}_k$,

Figures (23)

  • Figure 1: Overview of the main findings of the paper. Training and validation curves for Transformers (top left) and LSTMs (bottom left) on 10 different ${k\textsc{-sparse}}$ functions in the presence of 10% noise. Sensitivity of Transformers and LSTMs during training on random Boolean functions (top right). Distribution of sensitivity of functions represented by random models with weights initialized according to normal distribution ($\mu$=0, $\sigma$=10) (bottom right).
  • Figure 2: Distribution of sensitivity of different randomly initialized Transformers and LSTMs. Top row: Transformers (left) and LSTMs (right) with uniformly sampled weights across various hyperparameters. Bottom row: Analogous to top row but with Xavier normal initialization. Refer to Section \ref{['subsec:randsensi_exp']} for details.
  • Figure 3: Distribution of sensitivity of randomly initialized Transformers and LSTMs for a fixed hyperparameter (layers=2, width=256).
  • Figure 4: Distribution of sensitivity of Transformers and LSTMs trained on Boolean strings with random labels. Refer to Section \ref{['subsec:randbool_exp']} for details.
  • Figure 5: Training and validation curves for Transformers and LSTMs trained on $\textsc{Sparse Parities}$ of length $n=40$ and $k=4$.
  • ...and 18 more figures

Theorems & Definitions (1)

  • Proposition D.1