Trapped by simplicity: When Transformers fail to learn from noisy features

Evan Peters; Ando Deng; Matheus H. Zambianco; Devin Blankespoor; Achim Kempf

Trapped by simplicity: When Transformers fail to learn from noisy features

Evan Peters, Ando Deng, Matheus H. Zambianco, Devin Blankespoor, Achim Kempf

TL;DR

The paper investigates whether transformers can learn target Boolean functions from data with feature noise and generalize to noiseless inputs. Using a noisy bitflip model, it analyzes the Bayes-optimal predictor $f_N^* = \mathrm{sign}(T_{1-2p} f)$ and examines learning on $\mathsf{parity}$, $\mathsf{maj}$, and random $k$-juntas, comparing transformers to LSTMs. It shows that transformers succeed on sparse parity/odd-majority tasks under noise but falter on random $k$-juntas, attributing failures to a simplicity bias that biases learning toward low-sensitivity predictors when the noisy-optimal predictor $f_N^*$ is simpler than the target $f$. The authors demonstrate a trap scenario and show that a sensitivity-penalty loss can mitigate the trap in some cases, highlighting a path to improve noise-robust learning. Overall, the work reveals fundamental limits of current transformer inductive biases for learning complex Boolean relations from noisy data and motivates regularization strategies or data design to counteract simplicity bias in algorithmic/discrete tasks and potentially in natural language under stochastic inputs.

Abstract

Noise is ubiquitous in data used to train large language models, but it is not well understood whether these models are able to correctly generalize to inputs generated without noise. Here, we study noise-robust learning: are transformers trained on data with noisy features able to find a target function that correctly predicts labels for noiseless features? We show that transformers succeed at noise-robust learning for a selection of $k$-sparse parity and majority functions, compared to LSTMs which fail at this task for even modest feature noise. However, we find that transformers typically fail at noise-robust learning of random $k$-juntas, especially when the boolean sensitivity of the optimal solution is smaller than that of the target function. We argue that this failure is due to a combination of two factors: transformers' bias toward simpler functions, combined with an observation that the optimal function for noise-robust learning typically has lower sensitivity than the target function for random boolean functions. We test this hypothesis by exploiting transformers' simplicity bias to trap them in an incorrect solution, but show that transformers can escape this trap by training with an additional loss term penalizing high-sensitivity solutions. Overall, we find that transformers are particularly ineffective for learning boolean functions in the presence of feature noise.

Trapped by simplicity: When Transformers fail to learn from noisy features

TL;DR

and examines learning on

, and random

-juntas, comparing transformers to LSTMs. It shows that transformers succeed on sparse parity/odd-majority tasks under noise but falter on random

-juntas, attributing failures to a simplicity bias that biases learning toward low-sensitivity predictors when the noisy-optimal predictor

is simpler than the target

. The authors demonstrate a trap scenario and show that a sensitivity-penalty loss can mitigate the trap in some cases, highlighting a path to improve noise-robust learning. Overall, the work reveals fundamental limits of current transformer inductive biases for learning complex Boolean relations from noisy data and motivates regularization strategies or data design to counteract simplicity bias in algorithmic/discrete tasks and potentially in natural language under stochastic inputs.

Abstract

-sparse parity and majority functions, compared to LSTMs which fail at this task for even modest feature noise. However, we find that transformers typically fail at noise-robust learning of random

-juntas, especially when the boolean sensitivity of the optimal solution is smaller than that of the target function. We argue that this failure is due to a combination of two factors: transformers' bias toward simpler functions, combined with an observation that the optimal function for noise-robust learning typically has lower sensitivity than the target function for random boolean functions. We test this hypothesis by exploiting transformers' simplicity bias to trap them in an incorrect solution, but show that transformers can escape this trap by training with an additional loss term penalizing high-sensitivity solutions. Overall, we find that transformers are particularly ineffective for learning boolean functions in the presence of feature noise.

Paper Structure (20 sections, 7 theorems, 47 equations, 7 figures, 4 tables)

This paper contains 20 sections, 7 theorems, 47 equations, 7 figures, 4 tables.

Introduction
Prior work
Background
Transformers succeed at noise-robust learning of sparse parities and odd majorities
Noise robustness versus simplicity bias
Trapping transformers with simplicity bias
Discussion
Conclusion
Background
Boolean analysis
Information theory
Optimal next-bit prediction
Proof of Proposition \ref{['prop:new']}
Discussion of other noise models
Experimental methods
...and 5 more sections

Key Result

Proposition 1

For each function $f \in \{\mathop{\mathrm{\textsc{maj}}}\nolimits_n, \mathop{\mathrm{\textsc{parity}}}\nolimits\}$ ($n$ odd), $f$ is optimal for prediction on noisy features data, i.e. $f = f_N^*$.

Figures (7)

Figure 1: Transformers learn $\mathop{\mathrm{\textsc{maj}}}\nolimits_n$ (with odd $n$) and $\mathop{\mathrm{\textsc{parity}}}\nolimits$ robustly from noisy features. (a-c) For $\mathop{\mathrm{\textsc{maj}}}\nolimits(20, 5)$ and $\mathop{\mathrm{\textsc{maj}}}\nolimits(40, 5)$, the median transformer (SAN) reliably outperforms the best LSTM across 300 training runs with a variety of hyperparameters tuned to optimize both architectures' success probability. Validation accuracy approximates $\mathop{\mathrm{err}}\nolimits_f(\hat{f})$ using 10000 examples, where $\hat{f}$ is either an LSTM or SAN prediction rule. Each point on the solid lines represents the best (median) LSTM (SAN) from 300 training experiments. (d) While both LSTMs and SANs fail in a large fraction of training experiments learning $\mathop{\mathrm{\textsc{parity}}}\nolimits(20, 4)$ with feature noise, transformers successfully learn $\mathop{\mathrm{\textsc{parity}}}\nolimits(20, 4)$ (defined as achieving noiseless accuracy $\geq 95\%$) more often than LSTMs, even when both architectures perform comparably at zero noise rate. See Appendix \ref{['app:experiments']} for experiment details.
Figure 2: Transformers generally fail at noise-robust learning for random $k$-juntas, and perform worse as the difference in sensitivity and validation error for $f$ versus $f_N^*$ grows. (a) Each point represents a randomly sampled $k$-junta $f$ (3200 total). Every randomly sample $f$ obeys $\mathop{\mathrm{I}}\nolimits[f] \geq \mathop{\mathrm{I}}\nolimits[f_N^*]$, while by definition $\mathop{\mathrm{err}}\nolimits_f(f) \geq \mathop{\mathrm{err}}\nolimits_f(f_N^*)$. Minimizing validation error in noise-robust learning will only succeed for functions near the bottom of the plot, while a training algorithm with low sensitivity bias will only succeed for points near the left of the plot. By Prop. \ref{['prop:sens_bias']}, $\mathop{\mathrm{\textsc{maj}}}\nolimits_n$ (odd $n$) and $\mathop{\mathrm{\textsc{parity}}}\nolimits$ are represented by the coordinate $(0, 0)$. (b-c) Transformers only succeed at noise-robust learning when $\mathop{\mathrm{I}}\nolimits[f] \approx \mathop{\mathrm{I}}\nolimits[f_N^*]$ and $\mathop{\mathrm{err}}\nolimits_f(f) \approx \mathop{\mathrm{err}}\nolimits_f(f_N^*)$ (across 3200 learning experiments). Histograms of models' final validation error, train error, and optimal error demonstrate that noise-robust learning fails despite (d) near-optimal performance with (e) little overfitting. See Appendix \ref{['app:experiments']} for additional experimental details.
Figure 3: LSTMs and transformers fail to learn $f$ from training data with feature noise in distinct ways. (a)-(b) We consider a particular trap function $f$ such that $\mathop{\mathrm{err}}\nolimits_f(f_N^*) \approx \mathop{\mathrm{err}}\nolimits_f(f)$, while $\mathop{\mathrm{I}}\nolimits[f_N^*] \ll \mathop{\mathrm{I}}\nolimits[f]$. Blue and red lines show (smoothed) training dynamics of transformers and LSTMs trained on noisy inputs across a variety of hyperparameters and initializations. Each point represents a (learned) boolean function. Transformers approach optimal validation accuracy ( ) while RNNs perform no better than memorization of training data ( ), and both models fail to learn $f$ (★). However, an explicit sensitivity penalty in the loss function $\lambda \mathop{\mathrm{I}}\nolimits[\hat{f}]$ ($\lambda = 1$) allows transformers to learn $f$ (green lines) (c) There is a clear optimum $\lambda$ for learning $f$ from noisy data using sensitivity penalty in the loss. (d-f) This behavior does not extend to functions where $\mathop{\mathrm{err}}\nolimits(f_N^*) \ll \mathop{\mathrm{err}}\nolimits_f(f)$, for example $\mathop{\mathrm{\textsc{maj}}}\nolimits(30, 4)$ with $p=0.32$, for which $f_N^*$ is a heavily biased function. (g) Overall, transformers do not outperform LSTMs at learning $\mathop{\mathrm{\textsc{maj}}}\nolimits(n, k)$ with even $n$ (shown: $\mathop{\mathrm{\textsc{maj}}}\nolimits(30, 4)$) with feature noise. See Appendix \ref{['app:exp3']} for additional details.
Figure 4: Validation accuracy for LSTMs (top row) and transformers (bottom row) for each sparse $\mathop{\mathrm{\textsc{maj}}}\nolimits$ noise-robust learning task, colored according to bitflip rate. Each experiment with a particular error rate consists of $300$ independent training runs with random initialization (maximum validation accuracy over each training trial is shown).
Figure 5: Validation accuracy for LSTMs (left) and transformers (right) for the $\mathop{\mathrm{\textsc{parity}}}\nolimits(20,4)$ noise-robust learning task (other details same as Fig. \ref{['fig:app_maj']}).
...and 2 more figures

Theorems & Definitions (11)

Proposition 1
Proposition 2
Theorem 3
Lemma 4
proof
Lemma 5
proof
Proposition 6: $\mathop{\mathrm{\textsc{parity}}}\nolimits$ is optimal for predicting noisy $\mathop{\mathrm{\textsc{parity}}}\nolimits$
proof
Proposition 7
...and 1 more

Trapped by simplicity: When Transformers fail to learn from noisy features

TL;DR

Abstract

Trapped by simplicity: When Transformers fail to learn from noisy features

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (11)