Table of Contents
Fetching ...

How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad

Emmanuel Abbe, Samy Bengio, Aryo Lotfi, Colin Sandon, Omid Saremi

TL;DR

The notion of 'globality degree' of a target distribution to capture when weak learning is efficiently achievable by regular Transformers is put forward.

Abstract

Can Transformers predict new syllogisms by composing established ones? More generally, what type of targets can be learned by such models from scratch? Recent works show that Transformers can be Turing-complete in terms of expressivity, but this does not address the learnability objective. This paper puts forward the notion of 'globality degree' of a target distribution to capture when weak learning is efficiently achievable by regular Transformers. This measure shows a contrast with the expressivity results of Transformers captured by $TC^0/TC^1$ classes (further studied here), since the globality relates to correlations with the more limited $NC^0$ class. We show here experimentally and theoretically under additional assumptions that distributions with high globality cannot be learned efficiently. In particular, syllogisms cannot be composed on long chains. Further, we develop scratchpad techniques and show that: (i) agnostic scratchpads cannot break the globality barrier, (ii) educated scratchpads can break the globality with intermediate steps, although not all such scratchpads can generalize out-of-distribution (OOD), (iii) a notion of 'inductive scratchpad', that composes the prior information more efficiently, can both break the globality barrier and improve the OOD generalization. In particular, some of our inductive scratchpads can achieve length generalizations of up to $6\times$ for some arithmetic tasks depending on the input formatting.

How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad

TL;DR

The notion of 'globality degree' of a target distribution to capture when weak learning is efficiently achievable by regular Transformers is put forward.

Abstract

Can Transformers predict new syllogisms by composing established ones? More generally, what type of targets can be learned by such models from scratch? Recent works show that Transformers can be Turing-complete in terms of expressivity, but this does not address the learnability objective. This paper puts forward the notion of 'globality degree' of a target distribution to capture when weak learning is efficiently achievable by regular Transformers. This measure shows a contrast with the expressivity results of Transformers captured by classes (further studied here), since the globality relates to correlations with the more limited class. We show here experimentally and theoretically under additional assumptions that distributions with high globality cannot be learned efficiently. In particular, syllogisms cannot be composed on long chains. Further, we develop scratchpad techniques and show that: (i) agnostic scratchpads cannot break the globality barrier, (ii) educated scratchpads can break the globality with intermediate steps, although not all such scratchpads can generalize out-of-distribution (OOD), (iii) a notion of 'inductive scratchpad', that composes the prior information more efficiently, can both break the globality barrier and improve the OOD generalization. In particular, some of our inductive scratchpads can achieve length generalizations of up to for some arithmetic tasks depending on the input formatting.
Paper Structure (55 sections, 9 theorems, 28 equations, 11 figures, 3 tables)

This paper contains 55 sections, 9 theorems, 28 equations, 11 figures, 3 tables.

Key Result

Lemma 1

We have $\mathrm{glob}(\text{Cycle task}(n))\ge n$.

Figures (11)

  • Figure 1: Illustration of the cycle task for $n=4$ (left) and the complexity to learn it (right).
  • Figure 2: The cycle task variant used in Theorem \ref{['3cycleTheorem']}: the above example is stored as a_0>b_1;b_0>c_1;c_0>a_1;a_1>a_2;b_1>c_2;c_1>b_2;a_2>b_3;b_2>c_3;c_2>a_3;a_3>b_0;b_3>a_0;c_3>c_0;a_0?b_0?c_0
  • Figure 3: An illustration showing how scratchpads can break the globality. The target may be efficiently learned if each scratchpad step is of low globality given the previous ones.
  • Figure 4: (Left) Learning the cycle task with a scratchpad. (Right) OOD generalization for the DFS and inductive scratchpads (see Section \ref{['sec:full-fails']}).
  • Figure 5: Length generalization for parity and addition tasks using different random seeds. The medians of the results are highlighted in bold.
  • ...and 6 more figures

Theorems & Definitions (28)

  • Definition 1: Cycle task
  • Definition 2
  • Remark 1
  • Definition 3
  • Lemma 1
  • Definition 4
  • Remark 2
  • Conjecture 1
  • Remark 3
  • Theorem 1
  • ...and 18 more