Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers
Lei Chen, Joan Bruna, Alberto Bietti
TL;DR
The paper addresses how Transformer architectures distribute distributional knowledge versus in-context reasoning across feed-forward and attention layers. By constructing a synthetic two-layer Transformer with noisy in-context recall, it shows feed-forward layers capture simple distributional cues (e.g., bigrams) while attention executes in-context reasoning, with gradient dynamics explaining the separation. The authors provide theoretical insights into training dynamics and demonstrate that targeted weight truncation (LASER) shifts pre-trained models toward better reasoning on IOI and factual recall tasks, and can enhance few-shot GSM8K performance. These findings offer a principled lens for understanding and steering layer-specific roles in LLMs, informing more effective fine-tuning and architecture design for reasoning-heavy tasks.
Abstract
Large language models have been successful at tasks involving basic forms of in-context reasoning, such as generating coherent language, as well as storing vast amounts of knowledge. At the core of the Transformer architecture behind such models are feed-forward and attention layers, which are often associated to knowledge and reasoning, respectively. In this paper, we study this distinction empirically and theoretically in a controlled synthetic setting where certain next-token predictions involve both distributional and in-context information. We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning. Our theoretical analysis identifies the noise in the gradients as a key factor behind this discrepancy. Finally, we illustrate how similar disparities emerge in pre-trained models through ablations on the Pythia model family on simple reasoning tasks.
