CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute

Chen Jin; Ryutaro Tanno; Tom Diethe; Philip Teare

CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute

Chen Jin, Ryutaro Tanno, Tom Diethe, Philip Teare

TL;DR

CoRefine introduces confidence-guided self-refinement, enabling adaptive, token-efficient reasoning by using a lightweight Conv1D controller to decide HALT, RETHINK, or ALTERNATIVE actions based on full-trace token-level confidence. By treating confidence as a control signal rather than a correctness estimate, the method achieves comparable or better accuracy than large-parallel sampling while reducing token usage by roughly 190× and delivering substantial wall-clock speedups. The approach is validated across multiple open-source models and diverse math benchmarks, with strong results in both standard and regulated-domain (BixBench) settings, and extended with a CoRefine Tree variant for hybrid sequential-parallel reasoning. The work provides a modular, generalizable primitive for scalable reasoning and agentic systems with imperfect verifiers, enabling targeted refinement and safe halting decisions in practical deployments.

Abstract

Large Language Models (LLMs) often rely on test-time scaling via parallel decoding (for example, 512 samples) to boost reasoning accuracy, but this incurs substantial compute. We introduce CoRefine, a confidence-guided self-refinement method that achieves competitive accuracy using a fraction of the tokens via a lightweight 211k-parameter Conv1D controller atop a frozen LLM. The controller consumes full-trace confidence to decide whether to halt, re-examine, or try a different approach, enabling targeted self-correction with an average of 2.7 refinement steps per problem and roughly 190-fold token reduction relative to 512-sample baselines. Across diverse reasoning benchmarks and three open-source models, the controller achieves 92.6 percent precision when it confidently halts, indicating that confidence dynamics reliably signal correctness without ground-truth verification. We extend this to CoRefine-Tree, a hybrid sequential-parallel variant that adaptively balances exploration and exploitation, with easy serving integration and verifier compatibility. By treating confidence as a control signal rather than a correctness guarantee, CoRefine provides a modular primitive for scalable reasoning and agentic settings with imperfect verifiers.

CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute

TL;DR

Abstract

Paper Structure (160 sections, 12 equations, 18 figures, 12 tables, 2 algorithms)

This paper contains 160 sections, 12 equations, 18 figures, 12 tables, 2 algorithms.

Introduction
Confidence as a Control Signal
Token-Level Confidence Extraction
Confidence Distributions: Correct vs. Incorrect
Key insight: Control vs. Estimation.
From Confidence to Control Actions
Confidence-Guided Self-Refine (CoRefine)
System Overview
Confidence Feature Extraction
Temporal Downsampling.
Why Not Text Features?
Neural Controller Architecture
Training.
Oracle Label Generation.
Theoretical Justification.
...and 145 more sections

Figures (18)

Figure 1: Top: Token efficiency versus accuracy across four reasoning benchmarks: AIME24, AIME25, BRUMO25 and HMMT25. CoRefine achieves competitive or superior accuracy to 512-sample or 20-sample majority voting with $\sim$190$\times$ fewer tokens. Wall-clock time versus accuracy showing that token savings translate to actual latency reduction, with CoRefine saving up to 63% over parallel baselines. Bottom: Confidence-Guided Self-Refine overview. The controller consumes full-trace confidence features of the LLM decoded reasoning chain and decides: HALT (accept current answer), RETHINK (verify reasoning), or ALTERNATIVE (explore new approach).
Figure 2: Averaged confidence evolution for correct vs. incorrect reasoning traces. Left: DeepSeek-R1-8B (12,060 traces). Right: Qwen3-32B (8,354 traces). Both models show correct traces maintaining higher late-phase confidence, but with distinct dynamics: DeepSeek exhibits increasing confidence for correct traces with a sharp terminal spike, while Qwen3 shows globally descending confidence for both classes.
Figure 3: (a) DeepConf - Parallel: Sample $K$ traces, filter by confidence, aggregate via weighted voting. (b) CoRefine - Sequential: Iteratively refine using controller decisions based on full-trace confidence. (c) CoRefine Tree - Hybrid: Combine parallel sampling with sequential refinement for best of both paradigms.
Figure 4: CoRefine Tree visualization on HMMT 2025 Q13 (Sophie's coordinate grid paths). Each node shows the model's answer and confidence; edge colors indicate controller decisions (green=HALT, red=RETHINK, orange=ALTERNATIVE). The controller achieves 100% precision: it HALTs only on the correct answer (2304) while correctly refining all 14 incorrect answers. This "zero false HALT" property---never stopping on wrong answers---is the controller's most critical safety guarantee.
Figure 5: Controller action distribution across benchmarks. The controller balances HALT (green), RETHINK (red), and ALTERNATIVE (orange) decisions based on confidence patterns. HALT rates are highest on BRUMO25 and AIME24 (66% and 64%), reflecting these benchmarks' higher tractability. HMMT25 shows the lowest HALT rate (48%) and highest RETHINK rate (37%), indicating the controller appropriately allocates more exploration effort to harder problems.
...and 13 more figures

CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute

TL;DR

Abstract

CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute

Authors

TL;DR

Abstract

Table of Contents

Figures (18)