Table of Contents
Fetching ...

Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs

Zorik Gekhman, Roee Aharoni, Eran Ofek, Mor Geva, Roi Reichart, Jonathan Herzig

TL;DR

This work designs a series of hypothesis-driven controlled experiments, and identifies two key driving mechanisms: a computational buffer effect and factual priming, where generating topically related facts acts as a semantic bridge that facilitates correct answer retrieval.

Abstract

While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Nevertheless, we find that enabling reasoning substantially expands the capability boundary of the model's parametric knowledge recall, unlocking correct answers that are otherwise effectively unreachable. Why does reasoning aid parametric knowledge recall when there are no complex reasoning steps to be done? To answer this, we design a series of hypothesis-driven controlled experiments, and identify two key driving mechanisms: (1) a computational buffer effect, where the model uses the generated reasoning tokens to perform latent computation independent of their semantic content; and (2) factual priming, where generating topically related facts acts as a semantic bridge that facilitates correct answer retrieval. Importantly, this latter generative self-retrieval mechanism carries inherent risks: we demonstrate that hallucinating intermediate facts during reasoning increases the likelihood of hallucinations in the final answer. Finally, we show that our insights can be harnessed to directly improve model accuracy by prioritizing reasoning trajectories that contain hallucination-free factual statements.

Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs

TL;DR

This work designs a series of hypothesis-driven controlled experiments, and identifies two key driving mechanisms: a computational buffer effect and factual priming, where generating topically related facts acts as a semantic bridge that facilitates correct answer retrieval.

Abstract

While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Nevertheless, we find that enabling reasoning substantially expands the capability boundary of the model's parametric knowledge recall, unlocking correct answers that are otherwise effectively unreachable. Why does reasoning aid parametric knowledge recall when there are no complex reasoning steps to be done? To answer this, we design a series of hypothesis-driven controlled experiments, and identify two key driving mechanisms: (1) a computational buffer effect, where the model uses the generated reasoning tokens to perform latent computation independent of their semantic content; and (2) factual priming, where generating topically related facts acts as a semantic bridge that facilitates correct answer retrieval. Importantly, this latter generative self-retrieval mechanism carries inherent risks: we demonstrate that hallucinating intermediate facts during reasoning increases the likelihood of hallucinations in the final answer. Finally, we show that our insights can be harnessed to directly improve model accuracy by prioritizing reasoning trajectories that contain hallucination-free factual statements.
Paper Structure (24 sections, 1 equation, 18 figures, 2 tables)

This paper contains 24 sections, 1 equation, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Pass@$k$ curves across two closed-book QA benchmarks and three LLMs, comparing the same models with reasoning OFF vs ON.
  • Figure 2: $\Omega$ in all models and datasets (§ \ref{['sec:setup']}). Models organized from the most (left) to the least effective (right) in terms of pass@$1$.
  • Figure 3: Reasoning Effectiveness on different question types in SimpleQA-Verified, with 95% confidence intervals.
  • Figure 4: Computation buffer effect on Gemini-2.5-Flash (§ \ref{['sec:test_time_compute']}). ON Single Dummy overrides the thinking trace with a short dummy sequence. ON Dummy does the same, but repeats the short dummy sequence to match the length of the original trace.
  • Figure 5: Reasoning effectiveness (Equation \ref{['eq:omega']}) as a function of the input length in tokens when conditioning on dummy reasoning trace (see § \ref{['sec:test_time_compute']}). ON Dummy X overrides the reasoning trace with a short dummy sequence which is repeated such that the input length will be X.
  • ...and 13 more figures