Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks

Rahul Ramesh; Ekdeep Singh Lubana; Mikail Khona; Robert P. Dick; Hidenori Tanaka

Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks

Rahul Ramesh, Ekdeep Singh Lubana, Mikail Khona, Robert P. Dick, Hidenori Tanaka

TL;DR

This study demonstrates that autoregressive Transformers trained on a synthetic, well-defined data-generating process can learn to compose a large set of predefined capabilities and generalize to exponentially or combinatorially many unseen functions. Step-by-step prompting, which exposes intermediate results, significantly enhances compositional generalization, while direct prompting often fails unless data diversity is extremely high. The authors formalize capabilities and compositions, differentiate in-order and out-of-order generalization, and provide mechanistic insights showing how attention selects task tokens and how MLP layers implement the composed functions. They further analyze training dynamics and attention behavior, offering theoretical constructions and empirical evidence that support a mechanistic view of compositionality in Transformers. Overall, the work suggests that composition in neural models can emerge under controlled synthetic conditions and that inference protocols enabling recursion through intermediate outputs can unlock rich, previously unseen capabilities with implications for probing model competencies.

Abstract

Transformers trained on huge text corpora exhibit a remarkable set of capabilities, e.g., performing basic arithmetic. Given the inherent compositional nature of language, one can expect the model to learn to compose these capabilities, potentially yielding a combinatorial explosion of what operations it can perform on an input. Motivated by the above, we train autoregressive Transformer models on a synthetic data-generating process that involves compositions of a set of well-defined monolithic capabilities. Through a series of extensive and systematic experiments on this data-generating process, we show that: (1) autoregressive Transformers can learn compositional structures from small amounts of training data and generalize to exponentially or even combinatorially many functions; (2) generating intermediate outputs when composing functions is more effective for generalizing to new, unseen compositions than not generating any intermediate outputs (3) biases in the order of the compositions in the training data result in Transformers that fail to compose some combinations of functions; and (4) the attention layers select which capability to apply while the feed-forward layers execute the selected capability.

Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks

TL;DR

Abstract

Paper Structure (50 sections, 2 theorems, 36 equations, 22 figures, 2 tables)

This paper contains 50 sections, 2 theorems, 36 equations, 22 figures, 2 tables.

Introduction
Related Work
Capabilities in a Transformer.
Compositionality in neural networks.
Formalizing capabilities and compositions
Experimental Setup and Data-Generating process
Results
Combinatorial explosion and Exponential growth in capabilities
In-order vs. Out-of-order generalization
Direct vs. step-by-step compositions
Why is compositional generalization harder for direct prompts? (\ref{['s:app:step_vs_direct']})
Towards a mechanistic understanding
Training dynamics
Conclusion
Experimental Details
...and 35 more sections

Key Result

Theorem C.1

There exists weights $P, Q, K, W_1, W_2$ and position encodings $P$ such that an Autoregressive Transformer can compositionally generalize to any prompt $[x_{F_1}, x_{F_2}, x_{F_3}, x_d]$. The values of the weights satisfy

Figures (22)

Figure 1: Signatures of compositionality. ChatGPT bubeck2023sparks correctly responds to prompts that require composition of atomic arithmetic capabilities (sum, cube, square)---we argue these prompts are unlikely to be in the training data. However, the model does not always compose reliably (top-right panel). This motivates us to study the extent to which a Transformer can learn to compose its capabilities by mere pretraining on a compositional domain.
Figure 2: Data generating process for in-order and out-of-order compositions. (a) Each of the $L=5$ positions is associated with $N=4$ functions $f_i^{[l]}$, in addition to an identity function, resulting in a total of $5 \times 4 + 1 = 21$ basis functions for composition. (b) The in-order compositions select functions within the same position while (c) out-of-order compositions allow for selecting functions across positions. Each position also includes the identity function since it allows us to compute compositions of fewer than $5$ functions. In the examples presented in (c), displaced functions are surrounded by a black line, and we then count the number of displaced functions.
Figure 3: Direct v.s. Step-by-step prompts. The task (rainbow) and data (blue) tokens can be completed in two ways. They are followed by: (a) the intermediate outputs of the composition in the step-by-step format or (b) directly by the final result of compositions in the direct format.
Figure 4: Transformers trained on the step-by-step format can generalize to an exponential (a) or combinatorial (b) number of new functions. We plot the accuracy averaged over all compositions of $L=5$ bijections, where each position of composition has 4+1 choices, with one of them being the identity function. Each curve corresponds to training data generated by a different subset of functions and the model is trained using the step-by-step prompt format. (a) The choice of 5 functions are different at different positions of composition---there are 21 different functions which can be composed (in-order) in 3125 different ways. (b) The choice of 5 functions are identical across all 5 positions of the composition which means there are 3125 different ways to compose them; only 1365 of them are unique. Both figures are evidence that one can train on a small number of compositions of functions (around 31-100) and generalize to exponentially (a) and combinatorially (b) many functions that would be considered "out-of-distribution".
Figure 5: The training data determines if a Transformer generalizes to an exponential (in-order generalization) or combinatorial (out-of-order generalization) number of functions. Each sub-plot uses a different subset of functions (from $\mathcal{F}_b$) to generate the training data and we evaluate them on combinatorial set of functions generated from 20+1 functions (one of them being identity). The x-axis varies the number of displacements and the y-axis varies the number of compositions---equivalently the number of functions that are not identity. We make the following observations: (1) A Transformer trained on just 31 functions (top-middle) generalize to nearly exponentially many or 3125 compositions of functions. (2) All the above configurations do not generalize perfectly to the entire combinatorial set. They however partially generalize to nearly 4 million compositions of functions. The generalization is worse if we increase the number of compositions or displacements (see \ref{['fig:data_gen']} for pictorial description of displacements).
...and 17 more figures

Theorems & Definitions (8)

Definition 3.1: Compositionality.
Definition 3.2: In-order vs. out-of-order Compositions.
Definition 3.3: Displacement.
Theorem C.1
proof
Theorem C.2
proof
Conjecture C.3

Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks

TL;DR

Abstract

Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (8)