Trace Length is a Simple Uncertainty Signal in Reasoning Models
Siddartha Devic, Charlotte Peale, Arwen Bradley, Sinead Williamson, Preetum Nakkiran, Aravind Gollakota
TL;DR
This paper addresses uncertainty quantification in large reasoning models by revealing that the length of reasoning traces—trace length (TL)—serves as a simple zero-shot confidence signal that becomes informative after reasoning post-training. Across multiple 32B and 7B models and ten diverse datasets, TL performs comparably to verbalized confidence (VC) and provides complementary information, with a simple VC+TL combination often yielding the best discrimination (AUROC). The authors uncover mechanisms behind TL's emergence, notably its strong association with high-entropy forking tokens, and demonstrate that TL's usefulness persists even when prompts are unchanged or length-bias corrections are applied to GRPO-based training. The findings offer a practical baseline for uncertainty estimation in black-box reasoning models and point to forking-token dynamics as a fruitful area for further study. Overall, TL provides a lightweight, prompt-free, zero-shot UQ signal that can enhance reliability and detectability of incorrect model generations in real-world deployments.
Abstract
Uncertainty quantification for LLMs is a key research direction towards addressing hallucination and other issues that limit their reliable deployment. In this work, we show that reasoning trace length is a simple and useful confidence estimator in large reasoning models. Through comprehensive experiments across multiple models, datasets, and prompts, we show that trace length performs in comparable but complementary ways to other zero-shot confidence estimators such as verbalized confidence. Our work reveals that reasoning post-training fundamentally alters the relationship between trace length and accuracy, going beyond prior work that had shown that post-training causes traces to grow longer in general (e.g., "overthinking"). We investigate the mechanisms behind trace length's performance as a confidence signal, observing that the effect remains even after adjusting for confounders such as problem difficulty and GRPO-induced length bias. We identify high-entropy or "forking" tokens as playing a key role in the mechanism. Our findings demonstrate that reasoning post-training enhances uncertainty quantification beyond verbal expressions, and establish trace length as a practical confidence measure for large reasoning models.
