How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad

Emmanuel Abbe; Samy Bengio; Aryo Lotfi; Colin Sandon; Omid Saremi

How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad

Emmanuel Abbe, Samy Bengio, Aryo Lotfi, Colin Sandon, Omid Saremi

TL;DR

The notion of 'globality degree' of a target distribution to capture when weak learning is efficiently achievable by regular Transformers is put forward.

Abstract

Can Transformers predict new syllogisms by composing established ones? More generally, what type of targets can be learned by such models from scratch? Recent works show that Transformers can be Turing-complete in terms of expressivity, but this does not address the learnability objective. This paper puts forward the notion of 'globality degree' of a target distribution to capture when weak learning is efficiently achievable by regular Transformers. This measure shows a contrast with the expressivity results of Transformers captured by $TC^0/TC^1$ classes (further studied here), since the globality relates to correlations with the more limited $NC^0$ class. We show here experimentally and theoretically under additional assumptions that distributions with high globality cannot be learned efficiently. In particular, syllogisms cannot be composed on long chains. Further, we develop scratchpad techniques and show that: (i) agnostic scratchpads cannot break the globality barrier, (ii) educated scratchpads can break the globality with intermediate steps, although not all such scratchpads can generalize out-of-distribution (OOD), (iii) a notion of 'inductive scratchpad', that composes the prior information more efficiently, can both break the globality barrier and improve the OOD generalization. In particular, some of our inductive scratchpads can achieve length generalizations of up to $6\times$ for some arithmetic tasks depending on the input formatting.

How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad

TL;DR

The notion of 'globality degree' of a target distribution to capture when weak learning is efficiently achievable by regular Transformers is put forward.

Abstract

classes (further studied here), since the globality relates to correlations with the more limited

class. We show here experimentally and theoretically under additional assumptions that distributions with high globality cannot be learned efficiently. In particular, syllogisms cannot be composed on long chains. Further, we develop scratchpad techniques and show that: (i) agnostic scratchpads cannot break the globality barrier, (ii) educated scratchpads can break the globality with intermediate steps, although not all such scratchpads can generalize out-of-distribution (OOD), (iii) a notion of 'inductive scratchpad', that composes the prior information more efficiently, can both break the globality barrier and improve the OOD generalization. In particular, some of our inductive scratchpads can achieve length generalizations of up to

for some arithmetic tasks depending on the input formatting.

Paper Structure (55 sections, 9 theorems, 28 equations, 11 figures, 3 tables)

This paper contains 55 sections, 9 theorems, 28 equations, 11 figures, 3 tables.

Introduction
Syllogisms composition
Hardness of long compositions
Hardness of global reasoning
Our contributions
Results on the global reasoning barrier
Defining globality and auto-regressive globality
Attributes of $\mathrm{glob}$ and some examples.
Transformers require low globality: formal results
Agnostic scratchpads cannot break the globality
Scratchpads to break the globality
Educated scratchpad
Results for learning parities.
Results for the cycle task.
Inductive Scratchpads
...and 40 more sections

Key Result

Lemma 1

We have $\mathrm{glob}(\text{Cycle task}(n))\ge n$.

Figures (11)

Figure 1: Illustration of the cycle task for $n=4$ (left) and the complexity to learn it (right).
Figure 2: The cycle task variant used in Theorem \ref{['3cycleTheorem']}: the above example is stored as a_0>b_1;b_0>c_1;c_0>a_1;a_1>a_2;b_1>c_2;c_1>b_2;a_2>b_3;b_2>c_3;c_2>a_3;a_3>b_0;b_3>a_0;c_3>c_0;a_0?b_0?c_0
Figure 3: An illustration showing how scratchpads can break the globality. The target may be efficiently learned if each scratchpad step is of low globality given the previous ones.
Figure 4: (Left) Learning the cycle task with a scratchpad. (Right) OOD generalization for the DFS and inductive scratchpads (see Section \ref{['sec:full-fails']}).
Figure 5: Length generalization for parity and addition tasks using different random seeds. The medians of the results are highlighted in bold.
...and 6 more figures

Theorems & Definitions (28)

Definition 1: Cycle task
Definition 2
Remark 1
Definition 3
Lemma 1
Definition 4
Remark 2
Conjecture 1
Remark 3
Theorem 1
...and 18 more

How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad

TL;DR

Abstract

How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (28)