Table of Contents
Fetching ...

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

James Xu Zhao, Bryan Hooi, See-Kiong Ng

TL;DR

The paper interrogates test-time scaling (long-chain-of-thought reasoning at inference) for knowledge-intensive tasks and finds limited, often negative, benefits. Through large-scale empirical evaluation of 14 models on SimpleQA and FRAMES, it shows that longer reasoning rarely improves accuracy and frequently increases hallucinations, with fluctuations largely driven by whether models choose to answer. A detailed analysis reveals that reductions in hallucinations often come from abstaining, while increases stem from attempting previously unanswered questions, sometimes coupled with confirmation bias. An information-theoretic treatment formalizes a ceiling: compute-only test-time scaling cannot introduce new information about the ground-truth answer beyond what is encoded in the trained model, explaining the observed limits and suggesting that gains must come from external information or different paradigms.

Abstract

Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has improved performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks. We evaluate 14 reasoning models on two knowledge-intensive benchmarks and find that increasing test-time computation does not consistently improve accuracy and often increases hallucinations. Further analysis shows that changes in hallucination rates under increased test-time computation are largely driven by models' willingness to answer. We also observe that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Finally, we provide an information-theoretic account: compute-only test-time scaling is a post-processing of a fixed trained model and therefore cannot increase information about the ground-truth answer beyond what is already encoded in the model, explaining its limited gains on knowledge-intensive tasks. Code and data are available at https://github.com/XuZhao0/tts-knowledge

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

TL;DR

The paper interrogates test-time scaling (long-chain-of-thought reasoning at inference) for knowledge-intensive tasks and finds limited, often negative, benefits. Through large-scale empirical evaluation of 14 models on SimpleQA and FRAMES, it shows that longer reasoning rarely improves accuracy and frequently increases hallucinations, with fluctuations largely driven by whether models choose to answer. A detailed analysis reveals that reductions in hallucinations often come from abstaining, while increases stem from attempting previously unanswered questions, sometimes coupled with confirmation bias. An information-theoretic treatment formalizes a ceiling: compute-only test-time scaling cannot introduce new information about the ground-truth answer beyond what is encoded in the trained model, explaining the observed limits and suggesting that gains must come from external information or different paradigms.

Abstract

Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has improved performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks. We evaluate 14 reasoning models on two knowledge-intensive benchmarks and find that increasing test-time computation does not consistently improve accuracy and often increases hallucinations. Further analysis shows that changes in hallucination rates under increased test-time computation are largely driven by models' willingness to answer. We also observe that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Finally, we provide an information-theoretic account: compute-only test-time scaling is a post-processing of a fixed trained model and therefore cannot increase information about the ground-truth answer beyond what is already encoded in the model, explaining its limited gains on knowledge-intensive tasks. Code and data are available at https://github.com/XuZhao0/tts-knowledge

Paper Structure

This paper contains 63 sections, 2 theorems, 30 equations, 8 figures, 3 tables.

Key Result

Theorem 1

For any compute-only TTS method,

Figures (8)

  • Figure 1: Accuracy with increased test-time computation across 14 reasoning models on SimpleQA and FRAMES. For most models, extended test-time reasoning does not consistently improve accuracy. While some models, such as GPT-5, show initial accuracy gains, further increasing reasoning length brings little or no additional improvement. In many cases, such as Claude Sonnet 4 and Qwen3 models, accuracy plateaus or fluctuates with no clear upward trend.
  • Figure 2: Hallucination ratio with increased test-time reasoning across 14 models on SimpleQA and FRAMES. For most models, longer reasoning does not reduce hallucinations. In many cases, such as GPT-5 mini and gpt-oss-20b, hallucination increases with longer thinking length. Only Grok-3 mini and DS-R1-Distill-Qwen-14B exhibit reduced hallucinations with extended reasoning on both tasks.
  • Figure 3: Changes in hallucination behavior with more thinking. We compare model responses at different reasoning levels, focusing on cases where one response is a hallucination and the other is not. Results show that reduced hallucinations result from abstention, while more hallucinations stem from the model attempting previously unanswered questions.
  • Figure 4: Thinking more leads to more hallucinations for gpt-oss-20b. In this example, it abstains under low reasoning effort, but produces overconfident hallucinations with confirmation bias at high effort. See Appendix \ref{['appendix:example']} for more examples and full reasoning traces.
  • Figure 5: Visualization of incorrect final answer positions in the reasoning traces of gpt-oss-20b across different thinking levels. Red markers indicate the normalized token positions of the final incorrect answer within the trace. Each subplot presents 10 randomly sampled cases. As test-time computation increases, hallucinated answers appear more frequently in the reasoning traces.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Definition 1: Arbitrary Facts
  • Definition 2: Compute-only TTS
  • Theorem 1: Data-processing limitation of compute-only TTS
  • Corollary 1: Accuracy bound via Fano's inequality
  • Remark 1
  • proof
  • proof