Table of Contents
Fetching ...

Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization

Pál Zsámboki, Benjamin Levi, David Ansel Josef Smith, Mitansh Kagalwala, Arlington Kell, Samuel Liechty, Cong Wang

TL;DR

This work defines the Set Complement Task to study length generalization in single-layer, attention-only transformers and proves tight embedding and value-dimension bounds. It shows that solving the task at lengths 1 and 2 with balanced logit displacements guarantees generalization to longer lengths, albeit with precision that decays roughly as $\frac{2}{s}$ with length $s$. The authors identify attention dispersion as a core mechanism limiting long-sequence performance and propose dropout and Bias-Corrected EMA (BEMA) to counteract it, validating these strategies via extensive random-hyperparameter searches. They further demonstrate that BEMA improves length generalization in a more complex setting with OthelloGPT, suggesting practical utility for real-world sequence modeling where multiple next moves are plausible. Overall, the paper provides a principled framework linking model capacity, training dynamics, and stabilization techniques to length generalization in algorithmic tasks.

Abstract

We study length generalization in transformers through the set complement task, where a model must predict a uniform distribution over tokens absent from an input sequence -- an ability central to board-game style reasoning. Our main theoretical result establishes two statements. First, we prove tight bounds on embedding and value dimensions for single-layer attention-only transformers. Second, we show that if such a model achieves balanced logit displacement at lengths 1 and 2, then it must generalize to longer sequences, though with reduced precision. A mechanistic reading of the proof explains this limitation: as more tokens are attended to, softmax compresses logit displacements, eroding separation between valid and invalid outputs. Training dynamics also suggest a second obstacle: when many next tokens are possible, updates become noisy. We hypothesize that dropout can counteract the first effect and Exponential Moving Average (EMA) the second. We validate these hypotheses through random hyperparameter search on the set complement task, which confirms both mechanisms. We then test OthelloGPT, a GPT-1 style model trained on random Othello moves, and find that EMA again improves length generalization in this more complex setting.

Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization

TL;DR

This work defines the Set Complement Task to study length generalization in single-layer, attention-only transformers and proves tight embedding and value-dimension bounds. It shows that solving the task at lengths 1 and 2 with balanced logit displacements guarantees generalization to longer lengths, albeit with precision that decays roughly as with length . The authors identify attention dispersion as a core mechanism limiting long-sequence performance and propose dropout and Bias-Corrected EMA (BEMA) to counteract it, validating these strategies via extensive random-hyperparameter searches. They further demonstrate that BEMA improves length generalization in a more complex setting with OthelloGPT, suggesting practical utility for real-world sequence modeling where multiple next moves are plausible. Overall, the paper provides a principled framework linking model capacity, training dynamics, and stabilization techniques to length generalization in algorithmic tasks.

Abstract

We study length generalization in transformers through the set complement task, where a model must predict a uniform distribution over tokens absent from an input sequence -- an ability central to board-game style reasoning. Our main theoretical result establishes two statements. First, we prove tight bounds on embedding and value dimensions for single-layer attention-only transformers. Second, we show that if such a model achieves balanced logit displacement at lengths 1 and 2, then it must generalize to longer sequences, though with reduced precision. A mechanistic reading of the proof explains this limitation: as more tokens are attended to, softmax compresses logit displacements, eroding separation between valid and invalid outputs. Training dynamics also suggest a second obstacle: when many next tokens are possible, updates become noisy. We hypothesize that dropout can counteract the first effect and Exponential Moving Average (EMA) the second. We validate these hypotheses through random hyperparameter search on the set complement task, which confirms both mechanisms. We then test OthelloGPT, a GPT-1 style model trained on random Othello moves, and find that EMA again improves length generalization in this more complex setting.

Paper Structure

This paper contains 21 sections, 2 theorems, 28 equations, 5 figures, 2 tables.

Key Result

Theorem 4.2

Assume that the model $f_\theta$ has constant attention. Then the following statements hold: (a) Suppose that the model $f_\theta$ has precision $C>0$ at length 1. Then the matrix $\mathbf B + \mathbf D$ has rank at least $v-1$. In particular, we have $d\ge v-1$. (b) Suppose moreover that the model Then for each $3\le s<v$, the model $f_\theta$ has precision $\frac{2}{ s}C$ at length $s$.

Figures (5)

  • Figure 1: Summary of best metrics per model. We do not display ITR values as they are below $5\cdot10^{-4}$ in all cases.
  • Figure 2: Mean dropout rates of top portions of models per generalization gap and extra dimensions.
  • Figure 3: Means of top quantiles of BEMA model metrics and no BEMA metrics in OthelloGPT length generalization.
  • Figure 4: Metrics per BEMA power--EMA power pairs in OthelloGPT length generalization.
  • Figure 5: Generalization gap to TVD without BEMA in OthelloGPT, colored by embedding dropout rate.

Theorems & Definitions (5)

  • Example 4.1
  • Theorem 4.2
  • Lemma 4.3
  • proof
  • proof : Proof of Theorem \ref{['theorem:bounds and length generalization']}