How Do LLMs Perform Two-Hop Reasoning in Context?
Tianyu Guo, Hanlin Zhu, Ruiqi Zhang, Jiantao Jiao, Song Mei, Michael I. Jordan, Stuart Russell
TL;DR
This work analyzes how large language models perform two-hop reasoning in context and why distractors impair pretrained models. It introduces a distractor-rich synthetic dataset and demonstrates that pretrained LLMs resort to random guessing, with performance collapsing as distractors increase; however, targeted fine-tuning yields near-perfect accuracy and strong generalization to harder distractor settings. To uncover the mechanism, the authors train a symbolic three-layer Transformer and reverse-engineer its information flow, revealing a slow learning phase characterized by broad, non-discriminative attention, followed by a sharp phase transition to a structured sequential query mechanism that retrieves the source and bridge information in order and then infers the end token. They further show that this dynamics can be captured by a minimal three-parameter attention-only model, connecting training dynamics to mechanistic, layer-by-layer information routing and offering insight into how compositional reasoning arises in small Transformers. These findings advance understanding of in-context reasoning in LLMs and point toward principled design of robust, interpretable reasoning systems.
Abstract
``Socrates is human. All humans are mortal. Therefore, Socrates is mortal.'' This form of argument illustrates a typical pattern of two-hop reasoning. Formally, two-hop reasoning refers to the process of inferring a conclusion by making two logical steps, each connecting adjacent concepts, such that the final conclusion depends on the integration of both steps. It is one of the most fundamental components of human reasoning and plays a crucial role in both formal logic and everyday decision-making. Despite recent progress in large language models (LLMs), we surprisingly find that they can fail at solving simple two-hop reasoning problems when distractors are present. We observe on a synthetic dataset that pre-trained LLMs often resort to random guessing among all plausible conclusions. However, after few steps of fine-tuning, models achieve near-perfect accuracy and exhibit strong length generalization. To understand the underlying mechanisms, we train a 3-layer Transformer from scratch on a synthetic two-hop reasoning task and reverse-engineer its internal information flow. We observe a clear progression in the attention logits throughout training. This pictures a sharp phase transition from an initial stage of random guessing to the emergence of a structured sequential query mechanism, where the model first retrieves the preceding and the bridge concepts in the early layers and then uses them to infer the final answer. Finally, we show that these dynamics can be captured by a minimal three-parameter attention-only network.
