Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

James Xu Zhao; Bryan Hooi; See-Kiong Ng

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

James Xu Zhao, Bryan Hooi, See-Kiong Ng

TL;DR

The paper interrogates test-time scaling (long-chain-of-thought reasoning at inference) for knowledge-intensive tasks and finds limited, often negative, benefits. Through large-scale empirical evaluation of 14 models on SimpleQA and FRAMES, it shows that longer reasoning rarely improves accuracy and frequently increases hallucinations, with fluctuations largely driven by whether models choose to answer. A detailed analysis reveals that reductions in hallucinations often come from abstaining, while increases stem from attempting previously unanswered questions, sometimes coupled with confirmation bias. An information-theoretic treatment formalizes a ceiling: compute-only test-time scaling cannot introduce new information about the ground-truth answer beyond what is encoded in the trained model, explaining the observed limits and suggesting that gains must come from external information or different paradigms.

Abstract

Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has improved performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks. We evaluate 14 reasoning models on two knowledge-intensive benchmarks and find that increasing test-time computation does not consistently improve accuracy and often increases hallucinations. Further analysis shows that changes in hallucination rates under increased test-time computation are largely driven by models' willingness to answer. We also observe that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Finally, we provide an information-theoretic account: compute-only test-time scaling is a post-processing of a fixed trained model and therefore cannot increase information about the ground-truth answer beyond what is already encoded in the model, explaining its limited gains on knowledge-intensive tasks. Code and data are available at https://github.com/XuZhao0/tts-knowledge

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

TL;DR

Abstract

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (7)