Table of Contents
Fetching ...

Lessons from Studying Two-Hop Latent Reasoning

Mikita Balesni, Tomek Korbak, Owain Evans

TL;DR

This work probes whether large language models can exhibit latent two-hop reasoning without externally visible chain-of-thought. By fine-tuning multiple frontier models on synthetic facts and systematically varying how facts are learned and presented, the authors show that latent two-hop reasoning can emerge when facts co-occur in training data or when one fact is learned pretraining, but not when both facts are fully synthetic and learned separately. The results reveal a nuanced landscape: CoT enables reliable two-hop answers, while no-CoT performance depends strongly on dataset design (same-document, in-context prompts, or semi-synthetic setups). The study highlights critical methodological caveats for interpreting latent reasoning capabilities and suggests that robust oversight and monitoring of LLMs may require end-to-end evaluations rather than relying solely on CoT traces.

Abstract

Large language models can use chain-of-thought (CoT) to externalize reasoning, potentially enabling oversight of capable LLM agents. Prior work has shown that models struggle at two-hop question-answering without CoT. This capability is so basic that if it was a fundamental limitation, it would imply that many complex agentic tasks would similarly require CoT. We investigate LLM latent reasoning capabilities using two-hop question answering as a case study. Previous work on the gap between latent and externalized two-hop reasoning produced mixed evidence with inconclusive results. In this paper, we introduce a controlled setting for investigating two-hop reasoning in LLMs, where a positive result provides definitive evidence for latent reasoning. We fine-tune LLMs (including Llama 3 8B and GPT-4o) on synthetic facts and test two-hop reasoning over these facts. By using synthetic facts, we rule out memorization and reasoning shortcuts as explanations for two-hop performance. We observe a nuanced picture: Models fail to compose two synthetic facts, but can succeed when one fact is synthetic and the other is natural. These results demonstrate that LLMs are undeniably capable of latent two-hop reasoning, although it remains unclear how this ability scales with model size. Finally, we highlight a lesson for researchers studying LLM reasoning: when drawing conclusions about LLM latent reasoning, one must be careful to avoid both spurious successes (that stem from memorization and reasoning shortcuts) and spurious failures (that may stem from artificial experimental setups, divorced from training setups of frontier LLMs).

Lessons from Studying Two-Hop Latent Reasoning

TL;DR

This work probes whether large language models can exhibit latent two-hop reasoning without externally visible chain-of-thought. By fine-tuning multiple frontier models on synthetic facts and systematically varying how facts are learned and presented, the authors show that latent two-hop reasoning can emerge when facts co-occur in training data or when one fact is learned pretraining, but not when both facts are fully synthetic and learned separately. The results reveal a nuanced landscape: CoT enables reliable two-hop answers, while no-CoT performance depends strongly on dataset design (same-document, in-context prompts, or semi-synthetic setups). The study highlights critical methodological caveats for interpreting latent reasoning capabilities and suggests that robust oversight and monitoring of LLMs may require end-to-end evaluations rather than relying solely on CoT traces.

Abstract

Large language models can use chain-of-thought (CoT) to externalize reasoning, potentially enabling oversight of capable LLM agents. Prior work has shown that models struggle at two-hop question-answering without CoT. This capability is so basic that if it was a fundamental limitation, it would imply that many complex agentic tasks would similarly require CoT. We investigate LLM latent reasoning capabilities using two-hop question answering as a case study. Previous work on the gap between latent and externalized two-hop reasoning produced mixed evidence with inconclusive results. In this paper, we introduce a controlled setting for investigating two-hop reasoning in LLMs, where a positive result provides definitive evidence for latent reasoning. We fine-tune LLMs (including Llama 3 8B and GPT-4o) on synthetic facts and test two-hop reasoning over these facts. By using synthetic facts, we rule out memorization and reasoning shortcuts as explanations for two-hop performance. We observe a nuanced picture: Models fail to compose two synthetic facts, but can succeed when one fact is synthetic and the other is natural. These results demonstrate that LLMs are undeniably capable of latent two-hop reasoning, although it remains unclear how this ability scales with model size. Finally, we highlight a lesson for researchers studying LLM reasoning: when drawing conclusions about LLM latent reasoning, one must be careful to avoid both spurious successes (that stem from memorization and reasoning shortcuts) and spurious failures (that may stem from artificial experimental setups, divorced from training setups of frontier LLMs).

Paper Structure

This paper contains 49 sections, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Frontier models display a gap between CoT and no-CoT two-hop question-answering accuracy, although the gap may be reducing with model scale. Here we report performance on the dataset of two-hop questions about real world entities from biran2024hoppinglateexploringlimitations. While we control for some reasoning shortcuts (see Appendix \ref{['appendix:real_world_eval']} for details), no-CoT performance in such setting may still not reflect models' actual reasoning abilities, as models could simply be memorizing answers to two-hop questions that are present in pretraining documents yang2025largelanguagemodelsperform. This motivates our synthetic fact experiments, which attempt to provide more conclusive evidence of two-hop reasoning capabilities.
  • Figure 2: An example of our training and evaluation data. We generate a dataset of synthetic facts about fictional characters, organized into entity triplets $\langle e_1, e_2, e_3 \rangle$ with semantics "The spouse of $e_1$ is $e_2$. The birth city of $e_2$ is $e_3$". For each entity triplet (e.g. here $\langle$ Russ, Hay, Showing $\rangle$), we generate four types of QA pairs, as shown above. Following past work on injecting new knowledge into LLMs via fine-tuning berglund2023takencontextmeasuringsituationalberglund2024thereversalcurse, we paraphrase each QA pair 30 times using predefined templates to aid generalization. See Section \ref{['subsec:experiment1_experimental_setup']} for more details on the dataset.
  • Figure 3: Left: All evaluated models achieve high accuracy on one-hop questions and two-hop questions with chain-of-thought (CoT) prompting, but completely fail to demonstrate latent two-hop reasoning as evidenced by chance-level accuracy without CoT. Right: Furthermore, the no-CoT test loss is nearly identical to loss on randomly permuted test set responses throughout training for Llama 3 8B Instruct and Qwen 2.5 7B Instruct.
  • Figure 4: Performance of Llama 3 models trained with variations in fact storage. Staged training negatively affects single-hop and two-hop CoT performance but it remains above zero. Our intervention ( staged, layer-selective) decreases test loss slightly, but the loss remains close to chance-level. Note: the y-axis of the loss plot is zoomed in compared to Figure \ref{['fig:experiment_1']}.
  • Figure 5: Performance of models trained with different objectives intended to induce latent reasoning. Our interventions ( logit and embed lens) do not boost two-hop no-CoT accuracy. The two rightmost plots show empirical values of $\mathcal{L}_\text{aux}$ on the test set during training for both auxilary losses. $\mathcal{L}_\text{aux}$ tends to decrease for both, but it's either unstable (for logit lens) or tends to show signs of rapid overfitting (for embed lens). Note that cross-entropy of 10 and cosine similarity of 0.2 are poor values close to chance-level; perfect generalization would correspond to cross-entropy 0 and cosine similarity of 1.
  • ...and 9 more figures