Table of Contents
Fetching ...

Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling

Fanzeng Xia, Yidong Luo, Tinko Sebastian Bartels, Yaqi Xu, Tongxin Li

TL;DR

This work investigates whether in-context search prompting combined with test-time internal scaling can unlock LLMs' reasoning on super-hard tasks previously labeled unsolvable. Empirically, the authors show up to about $30\times$ improvements on NP-hard and real-world planning benchmarks by integrating advanced prompting (CoT/AoT) with internal scaling, with model-dependent gains across Qwen3 and Claude 3.7. Theoretically, they connect prompting strategies and trace lengths to complexity classes, showing CoT(poly) = P, AoT(poly) = NP, CoT(exp) = EXP, and AoT(exp) = NEXP, and introduce the notion of core reasoning tokens to explain why trace quality—not merely length—drives capability. The findings challenge prevailing evaluation paradigms, arguing for more robust, faith-based benchmarks that reveal the true operational reasoning boundaries of contemporary LLMs in real deployments.

Abstract

Recent research has highlighted that Large Language Models (LLMs), even when trained to generate extended long reasoning steps, still face significant challenges on hard reasoning problems. However, much of the existing literature relies on direct prompting with simple in-context learning examples for evaluation, which largely overlooks advanced techniques to elicit LLMs' deliberate reasoning before drawing conclusions that LLMs hit a performance ceiling. In this paper, we systematically explore the combined potential of in-context search and test-time scaling on super hard reasoning tasks. We find that by employing advanced in-context search prompting to LLMs augmented with internal scaling, one can achieve transformative performance breakthroughs on tasks previously deemed "unsolvable" (e.g., reported success rates below 5%). We provide both empirical results and theoretical analysis of how this combination can unleash LLM reasoning capabilities: i) Empirically, on controlled NP-hard tasks and complex real-world planning benchmarks, our approach achieves up to a 30x improvement in success rates compared to previously reported results without any external mechanisms; ii) Theoretically, we show that in-context search prompting, when combined with internal scaling, significantly extends the complexity class of solvable reasoning problems. These findings challenge prevailing assumptions about the limitations of LLMs on complex tasks, indicating that current evaluation paradigms systematically underestimate their true potential. Our work calls for a critical reassessment of how LLM reasoning is benchmarked and a more robust evaluation strategy that fully captures the true capabilities of contemporary LLMs, which can lead to a better understanding of their operational reasoning boundaries in real-world deployments.

Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling

TL;DR

This work investigates whether in-context search prompting combined with test-time internal scaling can unlock LLMs' reasoning on super-hard tasks previously labeled unsolvable. Empirically, the authors show up to about improvements on NP-hard and real-world planning benchmarks by integrating advanced prompting (CoT/AoT) with internal scaling, with model-dependent gains across Qwen3 and Claude 3.7. Theoretically, they connect prompting strategies and trace lengths to complexity classes, showing CoT(poly) = P, AoT(poly) = NP, CoT(exp) = EXP, and AoT(exp) = NEXP, and introduce the notion of core reasoning tokens to explain why trace quality—not merely length—drives capability. The findings challenge prevailing evaluation paradigms, arguing for more robust, faith-based benchmarks that reveal the true operational reasoning boundaries of contemporary LLMs in real deployments.

Abstract

Recent research has highlighted that Large Language Models (LLMs), even when trained to generate extended long reasoning steps, still face significant challenges on hard reasoning problems. However, much of the existing literature relies on direct prompting with simple in-context learning examples for evaluation, which largely overlooks advanced techniques to elicit LLMs' deliberate reasoning before drawing conclusions that LLMs hit a performance ceiling. In this paper, we systematically explore the combined potential of in-context search and test-time scaling on super hard reasoning tasks. We find that by employing advanced in-context search prompting to LLMs augmented with internal scaling, one can achieve transformative performance breakthroughs on tasks previously deemed "unsolvable" (e.g., reported success rates below 5%). We provide both empirical results and theoretical analysis of how this combination can unleash LLM reasoning capabilities: i) Empirically, on controlled NP-hard tasks and complex real-world planning benchmarks, our approach achieves up to a 30x improvement in success rates compared to previously reported results without any external mechanisms; ii) Theoretically, we show that in-context search prompting, when combined with internal scaling, significantly extends the complexity class of solvable reasoning problems. These findings challenge prevailing assumptions about the limitations of LLMs on complex tasks, indicating that current evaluation paradigms systematically underestimate their true potential. Our work calls for a critical reassessment of how LLM reasoning is benchmarked and a more robust evaluation strategy that fully captures the true capabilities of contemporary LLMs, which can lead to a better understanding of their operational reasoning boundaries in real-world deployments.

Paper Structure

This paper contains 21 sections, 5 theorems, 25 equations, 3 figures, 5 tables.

Key Result

Theorem 3.1

The class of languages decidable by decoder-only transformer models (satisfying the architectural assumptions in Assumption def:transformer detailed in the supplementary appendix), when augmented with a CoT of length $t(n) = \mathtt{poly}(n)$, is the complexity class $\mathsf{P}$.

Figures (3)

  • Figure 1: Overview of promoted LLM reasoning boundaries for hard problems via the combination of in-context search and test-time scaling. Our findings reveal that tasks previously reported as unsolvable are actually solvable and thus require a rethink of the current evaluation configurations. Evaluation is based on the Trip Planning task as an illustrative example.
  • Figure 2: Conceptual roadmap illustrating the power of in-context search and test-time scaling in pushing the reasoning boundary of LLMs. LLMs can solve problems in $\mathsf{P}$ (Theorem \ref{['thm:CoT']}) and $\mathsf{NP}$ (Theorem \ref{['thm:AoT']}) using standard CoT and AoT with polynomial-length traces. Internal Scaling, by extending these thought processes to exponential lengths, significantly pushes the reasoning boundary towards $\mathsf{EXP}$ (Theorem \ref{['thm:CoT_exp']}) and $\mathsf{NEXP}$ (Theorem \ref{['thm:AoT_exp']}).
  • Figure 3: Success rates (%) of the Qwen3 and Claude 3.7 models across various configurations for four challenging tasks: Vertex Cover, 3-Dimensional Matching (3DM), Trip Planning, and Meeting Planning.

Theorems & Definitions (20)

  • Definition 1: In-Context Search Prompting
  • Definition 2: Internal Scaling
  • Theorem 3.1: $\mathsf{CoT}(\mathtt{poly}(n)) = \mathsf{P}$ merrill2023expressive
  • Theorem 3.2: $\mathsf{AoT}(\mathtt{poly}(n)) = \mathsf{NP}$
  • Theorem 3.3: $\mathsf{CoT}(\mathtt{exp}(n)) = \mathsf{EXP}$
  • Theorem 3.4: $\mathsf{AoT}(\mathtt{exp}(n)) = \mathsf{NEXP}$
  • Theorem 3.5: Core Reasoning Tokens
  • Definition 3: Turing Machine
  • Definition 4: Deterministic and Nondeterministic Time Classes
  • Definition 5: Polynomial‐Time and Exponential‐Time Classes
  • ...and 10 more