Table of Contents
Fetching ...

Thought calibration: Efficient and confident test-time scaling

Menghua Wu, Cai Zhou, Stephen Bates, Tommi Jaakkola

TL;DR

This work tackles the compute cost of long chain-of-thought reasoning in language models by introducing thought calibration, a dynamic stopping rule guided by a reasoning graph framework. It leverages lightweight probes trained on hidden representations within a Learn then Test paradigm to provide calibrated risk control for terminating thinking early. Across three reasoning models and four datasets, thought calibration achieves up to a 60% reduction in thinking tokens in-distribution and up to 20% out-of-distribution without sacrificing accuracy, with consistency-based probes often offering better generalization. Limitations include reliance on calibration data similarity and linear probes, highlighting opportunities for richer probing and broader control of reasoning trajectories in future work.

Abstract

Reasoning large language models achieve impressive test-time scaling by thinking for longer, but this performance gain comes at significant compute cost. Directly limiting test-time budget hurts overall performance, but not all problems are equally difficult. We propose thought calibration to decide dynamically when thinking can be terminated. To calibrate our decision rule, we view a language model's growing body of thoughts as a nested sequence of reasoning trees, where the goal is to identify the point at which novel reasoning plateaus. We realize this framework through lightweight probes that operate on top of the language model's hidden representations, which are informative of both the reasoning structure and overall consistency of response. Based on three reasoning language models and four datasets, thought calibration preserves model performance with up to a 60% reduction in thinking tokens on in-distribution data, and up to 20% in out-of-distribution data.

Thought calibration: Efficient and confident test-time scaling

TL;DR

This work tackles the compute cost of long chain-of-thought reasoning in language models by introducing thought calibration, a dynamic stopping rule guided by a reasoning graph framework. It leverages lightweight probes trained on hidden representations within a Learn then Test paradigm to provide calibrated risk control for terminating thinking early. Across three reasoning models and four datasets, thought calibration achieves up to a 60% reduction in thinking tokens in-distribution and up to 20% out-of-distribution without sacrificing accuracy, with consistency-based probes often offering better generalization. Limitations include reliance on calibration data similarity and linear probes, highlighting opportunities for richer probing and broader control of reasoning trajectories in future work.

Abstract

Reasoning large language models achieve impressive test-time scaling by thinking for longer, but this performance gain comes at significant compute cost. Directly limiting test-time budget hurts overall performance, but not all problems are equally difficult. We propose thought calibration to decide dynamically when thinking can be terminated. To calibrate our decision rule, we view a language model's growing body of thoughts as a nested sequence of reasoning trees, where the goal is to identify the point at which novel reasoning plateaus. We realize this framework through lightweight probes that operate on top of the language model's hidden representations, which are informative of both the reasoning structure and overall consistency of response. Based on three reasoning language models and four datasets, thought calibration preserves model performance with up to a 60% reduction in thinking tokens on in-distribution data, and up to 20% in out-of-distribution data.

Paper Structure

This paper contains 25 sections, 1 theorem, 7 equations, 6 figures, 1 table.

Key Result

Theorem 3.4

Suppose $p_j$ is super-uniform under $H_j$ for all $j$. Let $\mathcal{A}$ be a family-wise error rate (FWER) controlling algorithm at level $\epsilon$. Then $\Lambda_\text{valid} = \mathcal{A}(p_1,\dots,p_m)$ satisfies Equation eq:realistic-answer.

Figures (6)

  • Figure 1: Overview of the problem and our goal. Illustrated example based on s1K-1.1 muennighoff2025s1.
  • Figure 2: On in-distribution data (held-out test split on s1K), variants of thought calibration achieve up to a 60% reduction in thinking tokens while maintaining full performance. Top right point: Complete DeepSeek-R1 thought trajectory from muennighoff2025s1. Crop: Fix thinking budget at 512, 1024, 2048, 4096, and 8192 tokens. Supervised: exit based on predicted likelihood of correctness. Consistent, and Leaf Novelty: exit based on predicted consistency of answer or graph. Supervised is over confident, since the test set contains unsolvable problems.
  • Figure 3: We applied thought calibration probes for DeepSeek-distilled Qwen-2.5 32B on standard math and science benchmarks, which may be out-of-distribution compared to the training and calibration sets, drawn from s1K. We achieve up to a 20% reduction in thinking tokens. While Consistent generally remains below the predetermined error rates, Supervised is overconfident (as expected).
  • Figure 4: Proportion of prompt tokens removed, for different thresholds, stratified by full thought length and whether the original model was able to solve the problem. Top: Naive max token thresholding. Bottom: Consistency calibration, DeepSeek-R1 distilled Qwen 32B, over GPQA Diamond. Cropping reduces token lengths uniformly, regardless of the input characteristics. Thought calibration has a preference for first trimming longer thoughts and cases where the language model tries but fails to make progress.
  • Figure 5: DeepSeek-R1 distilled Llama 70B Consistency probe on s1K-1.1 example from our test split, where color intensity is proportional to $\mathbb{P}(\text{consistent})$. The language model first reaches the correct answer in Step 38, backtracks with lower confidence, and returns to the answer in Step 41.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Theorem 3.4: Adapted from theorem 1 in angelopoulos2021learn