Table of Contents
Fetching ...

How Do LLMs Perform Two-Hop Reasoning in Context?

Tianyu Guo, Hanlin Zhu, Ruiqi Zhang, Jiantao Jiao, Song Mei, Michael I. Jordan, Stuart Russell

TL;DR

This work analyzes how large language models perform two-hop reasoning in context and why distractors impair pretrained models. It introduces a distractor-rich synthetic dataset and demonstrates that pretrained LLMs resort to random guessing, with performance collapsing as distractors increase; however, targeted fine-tuning yields near-perfect accuracy and strong generalization to harder distractor settings. To uncover the mechanism, the authors train a symbolic three-layer Transformer and reverse-engineer its information flow, revealing a slow learning phase characterized by broad, non-discriminative attention, followed by a sharp phase transition to a structured sequential query mechanism that retrieves the source and bridge information in order and then infers the end token. They further show that this dynamics can be captured by a minimal three-parameter attention-only model, connecting training dynamics to mechanistic, layer-by-layer information routing and offering insight into how compositional reasoning arises in small Transformers. These findings advance understanding of in-context reasoning in LLMs and point toward principled design of robust, interpretable reasoning systems.

Abstract

``Socrates is human. All humans are mortal. Therefore, Socrates is mortal.'' This form of argument illustrates a typical pattern of two-hop reasoning. Formally, two-hop reasoning refers to the process of inferring a conclusion by making two logical steps, each connecting adjacent concepts, such that the final conclusion depends on the integration of both steps. It is one of the most fundamental components of human reasoning and plays a crucial role in both formal logic and everyday decision-making. Despite recent progress in large language models (LLMs), we surprisingly find that they can fail at solving simple two-hop reasoning problems when distractors are present. We observe on a synthetic dataset that pre-trained LLMs often resort to random guessing among all plausible conclusions. However, after few steps of fine-tuning, models achieve near-perfect accuracy and exhibit strong length generalization. To understand the underlying mechanisms, we train a 3-layer Transformer from scratch on a synthetic two-hop reasoning task and reverse-engineer its internal information flow. We observe a clear progression in the attention logits throughout training. This pictures a sharp phase transition from an initial stage of random guessing to the emergence of a structured sequential query mechanism, where the model first retrieves the preceding and the bridge concepts in the early layers and then uses them to infer the final answer. Finally, we show that these dynamics can be captured by a minimal three-parameter attention-only network.

How Do LLMs Perform Two-Hop Reasoning in Context?

TL;DR

This work analyzes how large language models perform two-hop reasoning in context and why distractors impair pretrained models. It introduces a distractor-rich synthetic dataset and demonstrates that pretrained LLMs resort to random guessing, with performance collapsing as distractors increase; however, targeted fine-tuning yields near-perfect accuracy and strong generalization to harder distractor settings. To uncover the mechanism, the authors train a symbolic three-layer Transformer and reverse-engineer its information flow, revealing a slow learning phase characterized by broad, non-discriminative attention, followed by a sharp phase transition to a structured sequential query mechanism that retrieves the source and bridge information in order and then infers the end token. They further show that this dynamics can be captured by a minimal three-parameter attention-only model, connecting training dynamics to mechanistic, layer-by-layer information routing and offering insight into how compositional reasoning arises in small Transformers. These findings advance understanding of in-context reasoning in LLMs and point toward principled design of robust, interpretable reasoning systems.

Abstract

``Socrates is human. All humans are mortal. Therefore, Socrates is mortal.'' This form of argument illustrates a typical pattern of two-hop reasoning. Formally, two-hop reasoning refers to the process of inferring a conclusion by making two logical steps, each connecting adjacent concepts, such that the final conclusion depends on the integration of both steps. It is one of the most fundamental components of human reasoning and plays a crucial role in both formal logic and everyday decision-making. Despite recent progress in large language models (LLMs), we surprisingly find that they can fail at solving simple two-hop reasoning problems when distractors are present. We observe on a synthetic dataset that pre-trained LLMs often resort to random guessing among all plausible conclusions. However, after few steps of fine-tuning, models achieve near-perfect accuracy and exhibit strong length generalization. To understand the underlying mechanisms, we train a 3-layer Transformer from scratch on a synthetic two-hop reasoning task and reverse-engineer its internal information flow. We observe a clear progression in the attention logits throughout training. This pictures a sharp phase transition from an initial stage of random guessing to the emergence of a structured sequential query mechanism, where the model first retrieves the preceding and the bridge concepts in the early layers and then uses them to infer the final answer. Finally, we show that these dynamics can be captured by a minimal three-parameter attention-only network.

Paper Structure

This paper contains 47 sections, 11 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The loss and the predicted probabilities.Left (a): The cross entropy loss computed at the query token, with the label being the correct target end token ($[\text{End-T}_{}]$) in the preceding premises. Right (b): The predicted probabilities for different tokens throughout training. The $[\text{End-NT}_{}]$ line represents probabilities averaged across $[\text{End-NT}_{1}]$, $[\text{End-NT}_{2}]$, $[\text{End-NT}_{3}]$, and $[\text{Brg-NT}_{4}]$. Before approximately 800 steps, the $[\text{End-NT}_{}]$ and $[\text{End-T}_{}]$ lines remain close, indicating an almost random guessing behavior during this slow learning phase.
  • Figure 2: Attention logits heatmaps of the three-layer Transformer at the slow learning phase (training step 800): Left (a): The first layer. Chess-board-like pattern; Middle (b): The second layer. The parent tokens have uniform attention on all preceding tokens. Each child token still uniformly attends to all parent tokens appearing in the preceding premises; Right (c): The third layer. The query token uniformly attends to all child tokens in the preceding premises. The query token retrieves the information from all $[\text{End}_{}]$ and $[\text{Brg}_{}]$ tokens. The logit lens could give a complete explanation for the random guessing, as shown in Figure \ref{['fig:random-guessing']}.
  • Figure 3: Illustration of the logit lens results of the query token at the layer 3 during the slow learning phase. The $[\text{End}_{}]$ tokens have positive entries at their own positions and negative entries with their preceding $[\text{Brg}_{}]$ tokens. These positive and negative values cancel each other out after summation.
  • Figure 4: Attention logits heatmaps of the three-layer Transformer at the structured learning phase (training step 10000): Left (a): The first layer. Each child token strongly attends to its parent token; Middle (b): The second layer. The query token strongly attends to the target bridge token ($[\text{Brg-T}_{}]$); Right (c): The third layer. The query token strongly attends to the target end token ($[\text{End-T}_{}]$). The query retrieve the identity of the target end token ($[\text{End-T}_{}]$), enabling the correct next-token prediction.
  • Figure 5: Logit lens of the value states at the layer $3$: We use the model trained on step $800$. The results are averaged over $256$ sequences. The $y$-axis represents the entries on the logit lens output. The $x$-axis represents tokens. The bright color indicates larger value, and blue color indicates negative values. The bright diagonal line shows that all tokens have value states that strongly support predicting themselves. The left bottom blue part indicates that $\text{$[\text{End}_{}]$}$ tokens have negative values for $$[Brg_]$$ tokens.
  • ...and 1 more figures