A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task

Jannik Brinkmann; Abhay Sheshadri; Victor Levoso; Paul Swoboda; Christian Bartelt

A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task

Jannik Brinkmann, Abhay Sheshadri, Victor Levoso, Paul Swoboda, Christian Bartelt

TL;DR

The paper probes whether transformers truly implement internal reasoning by mechanistically analyzing a model trained on a symbolic, multi-step tree pathfinding task. It identifies a concrete set of mechanisms—backward chaining powered by deduction heads, parallel subproblem solving, register tokens as working memory, path merging, and a one-step lookahead—that enable the model to climb tree paths up to its depth, with causal scrubbing and linear probes validating their roles. The study shows that while the model can perform deductive reasoning within a bounded depth, it relies on parallelization and heuristic strategies when deeper reasoning is required, highlighting both the potential and the limits of current transformer architectures for systematic reasoning. These insights from a synthetic task shed light on the operating principles of transformers and suggest possible inductive biases toward parallel, memory-augmented search, while cautioning against overgeneralizing to complex, real-world reasoning in natural language models.

Abstract

Transformers demonstrate impressive performance on a range of reasoning benchmarks. To evaluate the degree to which these abilities are a result of actual reasoning, existing work has focused on developing sophisticated benchmarks for behavioral studies. However, these studies do not provide insights into the internal mechanisms driving the observed capabilities. To improve our understanding of the internal mechanisms of transformers, we present a comprehensive mechanistic analysis of a transformer trained on a synthetic reasoning task. We identify a set of interpretable mechanisms the model uses to solve the task, and validate our findings using correlational and causal evidence. Our results suggest that it implements a depth-bounded recurrent mechanisms that operates in parallel and stores intermediate results in selected token positions. We anticipate that the motifs we identified in our synthetic setting can provide valuable insights into the broader operating principles of transformers and thus provide a basis for understanding more complex models.

A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task

TL;DR

Abstract

Paper Structure (50 sections, 9 equations, 20 figures, 5 tables)

This paper contains 50 sections, 9 equations, 20 figures, 5 tables.

Introduction
Contributions
Related Work
Expressiveness of Transformers
Mechanistic Interpretability
Evaluating Reasoning Capabilities
Background
Transformer Notation
Linear Probes
Activation Patching
Causal Scrubbing
Experimental Setup
Task Description
Model Specification and Training Process
Symbolic Reasoning using Backward Chaining
...and 35 more sections

Figures (20)

Figure 1: Backward Chaining. Given an input prompt, the model concatenates edge tokens in a single token position (A), and copies the goal node into the final token position (B). The next step is then identified by applying an iterative algorithm that climbs the tree one level per layer (C).
Figure 2: Data Generation. To generate our training set, we (1) generate a binary tree, (2) select a leaf node as the goal node, and (3) determine the path from the root to the goal node.
Figure 4: Visualization of multi-layer attention patterns on an example input. We show the attention from three selected positions: the path position, register token at position 39, and register token at position 44. We show that the path node starts backward chaining from the specified goal, while the two register tokens start backward chaining from different subgoals. Each token is highlighted by the color of the token that most strongly attends to it. The intensity of the color is based on the magnitude of the attention score. For details on how we select the register tokens and more examples, see Appendix \ref{['app:attention-patterns']}.
Figure 5: To test whether the model predicts the next step using backward chaining, we perform resampling ablations on each head using causal scrubbing. We find that we can recover close to 100 % of the performance of the model for paths up to length $L-1$, providing strong evidence for our backward chaining hypothesis.
Figure 6: To test whether the model relies on subpaths stored in register tokens, we perform resampling ablations on the register token positions at $\mathbf{x}_i^4$. Here, we conducted 10 separate runs, each involving 1000 samples. For each run, we calculate the mean logit difference and report the 95 % confidence interval for the average effects observed across the runs. The results demonstrate that these subpaths are instrumental for paths longer than $L - 1$ steps.
...and 15 more figures

A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task

TL;DR

Abstract

A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task

Authors

TL;DR

Abstract

Table of Contents

Figures (20)