Table of Contents
Fetching ...

Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence

Amirhosein Ghasemabadi, Keith G. Mills, Baochun Li, Di Niu

TL;DR

Guided by Gut (GG) introduces a self-guided Test-Time Scaling framework that relies on intrinsic LLM signals—token-level confidence and novelty—augmented by reinforcement-learning fine-tuning to calibrate these signals. It replaces costly external verifiers with a light tree search (DVTS) guided by the intrinsic rewards, enabling small LLMs to match or exceed the performance of much larger models on challenging mathematical benchmarks while dramatically reducing GPU memory and KV-cache usage. Compared to Best-of-N and PRM-based approaches, GG achieves competitive accuracy with substantially faster inference and lower memory demands, making practical deployment of TTS more feasible. The approach demonstrates strong empirical gains on AIME, AMC, and MATH benchmarks and offers a scalable path toward efficient, locally deployable reasoning LLMs.

Abstract

Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models. Our method employs a lightweight tree search guided solely by intrinsic LLM signals, token-level confidence and step novelty. One critical innovation is improving the reliability of internal confidence estimates via a targeted reinforcement learning fine-tuning phase. Empirical evaluations on challenging mathematical reasoning benchmarks demonstrate that GG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matching or surpassing significantly larger models (e.g., 32B-70B parameters), while reducing GPU memory usage by up to 10x. Compared to PRM-based methods, GG achieves comparable accuracy with 8x faster inference speeds and 4-5x lower memory usage. Additionally, GG reduces KV cache memory usage by approximately 50% compared to the BoN strategy, facilitating more efficient and practical deployment of TTS techniques.

Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence

TL;DR

Guided by Gut (GG) introduces a self-guided Test-Time Scaling framework that relies on intrinsic LLM signals—token-level confidence and novelty—augmented by reinforcement-learning fine-tuning to calibrate these signals. It replaces costly external verifiers with a light tree search (DVTS) guided by the intrinsic rewards, enabling small LLMs to match or exceed the performance of much larger models on challenging mathematical benchmarks while dramatically reducing GPU memory and KV-cache usage. Compared to Best-of-N and PRM-based approaches, GG achieves competitive accuracy with substantially faster inference and lower memory demands, making practical deployment of TTS more feasible. The approach demonstrates strong empirical gains on AIME, AMC, and MATH benchmarks and offers a scalable path toward efficient, locally deployable reasoning LLMs.

Abstract

Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models. Our method employs a lightweight tree search guided solely by intrinsic LLM signals, token-level confidence and step novelty. One critical innovation is improving the reliability of internal confidence estimates via a targeted reinforcement learning fine-tuning phase. Empirical evaluations on challenging mathematical reasoning benchmarks demonstrate that GG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matching or surpassing significantly larger models (e.g., 32B-70B parameters), while reducing GPU memory usage by up to 10x. Compared to PRM-based methods, GG achieves comparable accuracy with 8x faster inference speeds and 4-5x lower memory usage. Additionally, GG reduces KV cache memory usage by approximately 50% compared to the BoN strategy, facilitating more efficient and practical deployment of TTS techniques.

Paper Structure

This paper contains 28 sections, 9 equations, 3 figures, 7 tables, 1 algorithm.

Figures (3)

  • Figure 1: We compare the performance and GPU VRAM usage of Guided by Gut (GG; stars) to Best-of-N (BoN; triangles) and Chain-of-Thought (CoT; circles) on several LLMs. GG achieves better accuracy at much lower memory cost (log-scaled).
  • Figure 2: Comparison of reasoning generation strategies. (1) Standard Chain-of-Thought (CoT) generates a single reasoning path autoregressively. (2) Search guided by an external Process Reward Model (PRM) explores multiple candidate steps ($s_1^t, s_2^t, \dots$), using PRM scores to select promising paths. (3) Our proposed Self-Guided Search similarly explores multiple steps but uses intrinsic signals, Confidence ($\mathcal{C}$) and Novelty ($N$), derived from the LLM to guide the search at each step without relying on an external PRM. In the example, $\mathbf{T}$ stands for an independent tree and $\mathbf{B}$ stands for a branch within that tree. Example text best read zoomed-in.
  • Figure 3: Answer Confidence Distribution Across Training Settings. Each subplot shows the normalized distribution of confdence scores for correct (green) and incorrect (orange) completions across different fine-tuning strategies. The vertical dashed lines mark the mean confidence for correct and wrong completions, respectively. The base model (left) is generally overconfident, with incorrect completions receiving high confidence scores. Fine-tuning with correctness reward (middle) improves accuracy but leaves the confidence distribution largely unchanged. Confidence-based fine-tuning (right) better separates correct from incorrect completions, showing improved calibration.