Table of Contents
Fetching ...

Trace Length is a Simple Uncertainty Signal in Reasoning Models

Siddartha Devic, Charlotte Peale, Arwen Bradley, Sinead Williamson, Preetum Nakkiran, Aravind Gollakota

TL;DR

This paper addresses uncertainty quantification in large reasoning models by revealing that the length of reasoning traces—trace length (TL)—serves as a simple zero-shot confidence signal that becomes informative after reasoning post-training. Across multiple 32B and 7B models and ten diverse datasets, TL performs comparably to verbalized confidence (VC) and provides complementary information, with a simple VC+TL combination often yielding the best discrimination (AUROC). The authors uncover mechanisms behind TL's emergence, notably its strong association with high-entropy forking tokens, and demonstrate that TL's usefulness persists even when prompts are unchanged or length-bias corrections are applied to GRPO-based training. The findings offer a practical baseline for uncertainty estimation in black-box reasoning models and point to forking-token dynamics as a fruitful area for further study. Overall, TL provides a lightweight, prompt-free, zero-shot UQ signal that can enhance reliability and detectability of incorrect model generations in real-world deployments.

Abstract

Uncertainty quantification for LLMs is a key research direction towards addressing hallucination and other issues that limit their reliable deployment. In this work, we show that reasoning trace length is a simple and useful confidence estimator in large reasoning models. Through comprehensive experiments across multiple models, datasets, and prompts, we show that trace length performs in comparable but complementary ways to other zero-shot confidence estimators such as verbalized confidence. Our work reveals that reasoning post-training fundamentally alters the relationship between trace length and accuracy, going beyond prior work that had shown that post-training causes traces to grow longer in general (e.g., "overthinking"). We investigate the mechanisms behind trace length's performance as a confidence signal, observing that the effect remains even after adjusting for confounders such as problem difficulty and GRPO-induced length bias. We identify high-entropy or "forking" tokens as playing a key role in the mechanism. Our findings demonstrate that reasoning post-training enhances uncertainty quantification beyond verbal expressions, and establish trace length as a practical confidence measure for large reasoning models.

Trace Length is a Simple Uncertainty Signal in Reasoning Models

TL;DR

This paper addresses uncertainty quantification in large reasoning models by revealing that the length of reasoning traces—trace length (TL)—serves as a simple zero-shot confidence signal that becomes informative after reasoning post-training. Across multiple 32B and 7B models and ten diverse datasets, TL performs comparably to verbalized confidence (VC) and provides complementary information, with a simple VC+TL combination often yielding the best discrimination (AUROC). The authors uncover mechanisms behind TL's emergence, notably its strong association with high-entropy forking tokens, and demonstrate that TL's usefulness persists even when prompts are unchanged or length-bias corrections are applied to GRPO-based training. The findings offer a practical baseline for uncertainty estimation in black-box reasoning models and point to forking-token dynamics as a fruitful area for further study. Overall, TL provides a lightweight, prompt-free, zero-shot UQ signal that can enhance reliability and detectability of incorrect model generations in real-world deployments.

Abstract

Uncertainty quantification for LLMs is a key research direction towards addressing hallucination and other issues that limit their reliable deployment. In this work, we show that reasoning trace length is a simple and useful confidence estimator in large reasoning models. Through comprehensive experiments across multiple models, datasets, and prompts, we show that trace length performs in comparable but complementary ways to other zero-shot confidence estimators such as verbalized confidence. Our work reveals that reasoning post-training fundamentally alters the relationship between trace length and accuracy, going beyond prior work that had shown that post-training causes traces to grow longer in general (e.g., "overthinking"). We investigate the mechanisms behind trace length's performance as a confidence signal, observing that the effect remains even after adjusting for confounders such as problem difficulty and GRPO-induced length bias. We identify high-entropy or "forking" tokens as playing a key role in the mechanism. Our findings demonstrate that reasoning post-training enhances uncertainty quantification beyond verbal expressions, and establish trace length as a practical confidence measure for large reasoning models.

Paper Structure

This paper contains 36 sections, 4 equations, 18 figures, 12 tables.

Figures (18)

  • Figure 1: Reasoning post-training improves both verbalized confidence and trace length as uncertainty signals. Scatter plot showing verbalized confidence and trace length performance in terms of AUROC. Each point represents a different dataset, and arrows connect the same dataset before and after reasoning post-training. Two reasoning models --- iw-SFT-32B qin2025supervised and OpenThinker2-32B guha2025openthoughtsdatarecipesreasoning --- have better verbalized confidence performance than their base model Qwen2.5-32B-Instruct for many datasets (utilizing \ref{['listing:numeric-prompt']}). However, the trace length also emerges as a powerful uncertainty signal after post-training, and has comparable power in predicting whether the response was correct.
  • Figure 2: AUROC performance of verbalized confidence (VC), trace length (TL), and their zero-shot sum VC+TL. (Left): Performance for the 32B base model Qwen2.5-Instruct and four 32B reasoning models post-trained from Qwen2.5 (see \ref{['appx:models']} for model details). Results are averaged over ten datasets (\ref{['appx:datasets']}) and three prompts (\ref{['appx:prompts']}) per model. After reasoning post-training: (1) trace length emerges as a reliable uncertainty signal, competitive with verbal confidence; and (2) summing VC and TL together almost always outperforms both TL and VC individually. (Right): We find similar results for two 7B reasoning models post-trained from Qwen2.5-7B.
  • Figure 3: Verbalized confidence (VC) and Trace Length (TL) are only loosely correlated. (Left): Distribution over ten datasets of spearman correlations between VC and TL per 32B model, demonstrating that the two quantities are correlated but not perfectly so (using \ref{['listing:linguistic-prompt']}). (Middle & Right): Heatmap of average correctness (middle) and sample density (right) for OpenThinker2-32B using \ref{['listing:numeric-prompt']} over all datasets, split by VC and TL. The upper right quadrant of the center heatmap demonstrates that using only VC or TL as an uncertainty measure in isolation (e.g., choosing a horizontal or vertical threshold) will not outperform using VC + TL.
  • Figure 4: Trace length strongly correlates with number of forking tokens in reasoning models. (Left): For each 32B model, the box plot shows the distribution of spearman correlation values between trace length (in tokens) and the count of the top 50 highest entropy "forking" tokens for each dataset. Distribution is across ten datasets. Very high correlation is observed for the two reasoning models compared to the base model. (Right): Table showing the performance in AUROC of trace length (TL), top 50 highest entropy forking tokens (FT), the normalized sum TL+FT, and sequence probability (SP) for OpenThinker2-32B across ten datasets. We also include the AUROC of the best single forking token (BFT) over the dataset, and the BestToken itself. Generation details and additional tables are in \ref{['appx:additional-forking-token-plots']}.
  • Figure 5: High entropy tokens help quantify uncertainty in 32B models. In each plot, we show the AUROC of the uncertainty score which counts the occurrence of any of the $k$ highest entropy forking tokens in each trace. As we increase $k$, the AUROC of the uncertainty score improves (see \ref{['sec:forking']} for details). The AUROC of trace length and sequence probability are displayed for reference. Additional plots per model and dataset are available in \ref{['appx:additional-forking-token-plots']}.
  • ...and 13 more figures

Theorems & Definitions (4)

  • Definition B.1: Accuracy
  • Definition B.2: Brier Score
  • Definition B.4: The Receiving Operating Characteristic (ROC) Curve and Area Under the Curve (AUROC)
  • Remark B.1