Table of Contents
Fetching ...

Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement

Dongxu Zhang, Hongqiang Lin, Yiding Sun, Pengyu Wang, Qirui Wang, Ning Yang, Jihua Zhu

TL;DR

This work implements a multi-metric classifier that triages queries by synthesizing semantic entropy, consensus reliability, and predicted reasoning depth that enables a differentiated refinement stage that applies efficient aggregation for simple queries while routing complex ones to a context-aware correction loop.

Abstract

Scaling test-time computation enhances LLM reasoning ability but faces a uniform computation paradox. Allocating identical resources leads to over-correction on simple tasks and insufficient refinement on complex ones. To address this, we propose CoFiCot, a coarse-to-fine adaptive framework that dynamically tailors inference strategies to problem difficulty. Specifically, we implement a multi-metric classifier that triages queries by synthesizing semantic entropy, consensus reliability, and predicted reasoning depth . This enables a differentiated refinement stage that applies efficient aggregation for simple queries while routing complex ones to a context-aware correction loop . We formalize correction as a stateful sequential propagation process , where each repair is strictly conditioned on the verified history of prior rectifications. By integrating Process Reward Models (PRMs) within this state-dependent trajectory, CoFiCot effectively bridges the gap between granular error localization and global logical coherence, preventing the context fragmentation typical of stateless refinement methods.

Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement

TL;DR

This work implements a multi-metric classifier that triages queries by synthesizing semantic entropy, consensus reliability, and predicted reasoning depth that enables a differentiated refinement stage that applies efficient aggregation for simple queries while routing complex ones to a context-aware correction loop.

Abstract

Scaling test-time computation enhances LLM reasoning ability but faces a uniform computation paradox. Allocating identical resources leads to over-correction on simple tasks and insufficient refinement on complex ones. To address this, we propose CoFiCot, a coarse-to-fine adaptive framework that dynamically tailors inference strategies to problem difficulty. Specifically, we implement a multi-metric classifier that triages queries by synthesizing semantic entropy, consensus reliability, and predicted reasoning depth . This enables a differentiated refinement stage that applies efficient aggregation for simple queries while routing complex ones to a context-aware correction loop . We formalize correction as a stateful sequential propagation process , where each repair is strictly conditioned on the verified history of prior rectifications. By integrating Process Reward Models (PRMs) within this state-dependent trajectory, CoFiCot effectively bridges the gap between granular error localization and global logical coherence, preventing the context fragmentation typical of stateless refinement methods.
Paper Structure (39 sections, 15 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 39 sections, 15 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Conceptual illustration of the fundamental paradox of uniform computation in LLM reasoning. (Top) For an Easy Question, the model correctly computes the answer in its initial pass but is forced into unnecessary iterations. This overthinking leads to over correction, where the correct answer is corrupted into a final incorrect one. (Bottom) For a Medium&Hard Question, the same fixed computational budget is insufficient. The reasoning process is prematurely terminated before all logical steps are completed, resulting in an insufficient refinement failure.
  • Figure 2: The complete workflow of the CoFiCot framework. The process begins in Stage 0, where the base LLM generates an initial ensemble of $k$ reasoning traces. This set is passed to Stage 1, which performs a parallel, multi metric analysis to assess difficulty. Easy problems are resolved with simple aggregation. Medium and Hard problems are channeled into Stage 2. This stage initiates an iterative loop. A PRM scores each step, identifying a flawed step, which is then fed into the Correction Process. Critically, this correction is context aware, conditioning on the history of previous, correct steps to generate a new solution. The refined solution is then selected via an ORM, and the process repeats until termination criteria are met.
  • Figure 3: Accuracy vs. Effective Sample Size ($k$) on the MATH dataset.
  • Figure 4: Token count and accuracy comparison across different datasets. The reported token count for CoFiCot includes the overhead from the initial sampling, Stage 1 classification tokens, and all PRM/ORM verification steps in Stage 2. While scaling Self-Consistency from $k=40$ to $k=120$ introduces substantial token overhead, our CoFiCot achieves significantly higher accuracy while being more token efficient than the 120-way SC baseline.