Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
Rahul Ramesh, Ekdeep Singh Lubana, Mikail Khona, Robert P. Dick, Hidenori Tanaka
TL;DR
This study demonstrates that autoregressive Transformers trained on a synthetic, well-defined data-generating process can learn to compose a large set of predefined capabilities and generalize to exponentially or combinatorially many unseen functions. Step-by-step prompting, which exposes intermediate results, significantly enhances compositional generalization, while direct prompting often fails unless data diversity is extremely high. The authors formalize capabilities and compositions, differentiate in-order and out-of-order generalization, and provide mechanistic insights showing how attention selects task tokens and how MLP layers implement the composed functions. They further analyze training dynamics and attention behavior, offering theoretical constructions and empirical evidence that support a mechanistic view of compositionality in Transformers. Overall, the work suggests that composition in neural models can emerge under controlled synthetic conditions and that inference protocols enabling recursion through intermediate outputs can unlock rich, previously unseen capabilities with implications for probing model competencies.
Abstract
Transformers trained on huge text corpora exhibit a remarkable set of capabilities, e.g., performing basic arithmetic. Given the inherent compositional nature of language, one can expect the model to learn to compose these capabilities, potentially yielding a combinatorial explosion of what operations it can perform on an input. Motivated by the above, we train autoregressive Transformer models on a synthetic data-generating process that involves compositions of a set of well-defined monolithic capabilities. Through a series of extensive and systematic experiments on this data-generating process, we show that: (1) autoregressive Transformers can learn compositional structures from small amounts of training data and generalize to exponentially or even combinatorially many functions; (2) generating intermediate outputs when composing functions is more effective for generalizing to new, unseen compositions than not generating any intermediate outputs (3) biases in the order of the compositions in the training data result in Transformers that fail to compose some combinations of functions; and (4) the attention layers select which capability to apply while the feed-forward layers execute the selected capability.
