Table of Contents
Fetching ...

Reasoning Models Better Express Their Confidence

Dongkeun Yoon, Seungone Kim, Sohee Yang, Sunkyoung Kim, Soyeon Kim, Yongil Kim, Eunbi Choi, Yireun Kim, Minjoon Seo

TL;DR

The paper shows that reasoning models employing extended chain-of-thought (CoT) not only solve problems effectively but also express their confidence more accurately than non-reasoning counterparts. Through benchmarking six reasoning models across six datasets, it finds calibration gains in 33 of 36 settings, driven by slow-thinking behaviors that allow dynamic confidence updates during CoT. Ablation experiments demonstrate that non-linear reasoning, alternative exploration, and backtracking contribute to improved calibration, while even non-reasoning models gain calibration when prompted to slow-think via in-context learning. These findings highlight slow thinking as a key mechanism for producing trustworthy, uncertainty-aware LLMs and suggest practical prompts to improve calibration across diverse tasks.

Abstract

Despite their strengths, large language models (LLMs) often fail to communicate their confidence accurately, making it difficult to assess when they might be wrong and limiting their reliability. In this work, we demonstrate that reasoning models that engage in extended chain-of-thought (CoT) reasoning exhibit superior performance not only in problem-solving but also in accurately expressing their confidence. Specifically, we benchmark six reasoning models across six datasets and find that they achieve strictly better confidence calibration than their non-reasoning counterparts in 33 out of the 36 settings. Our detailed analysis reveals that these gains in calibration stem from the slow thinking behaviors of reasoning models (e.g., exploring alternative approaches and backtracking) which enable them to adjust their confidence dynamically throughout their CoT, making it progressively more accurate. In particular, we find that reasoning models become increasingly better calibrated as their CoT unfolds, a trend not observed in non-reasoning models. Moreover, removing slow thinking behaviors from the CoT leads to a significant drop in calibration. Lastly, we show that non-reasoning models also demonstrate enhanced calibration when simply guided to slow think via in-context learning, fully isolating slow thinking as the source of the calibration gains.

Reasoning Models Better Express Their Confidence

TL;DR

The paper shows that reasoning models employing extended chain-of-thought (CoT) not only solve problems effectively but also express their confidence more accurately than non-reasoning counterparts. Through benchmarking six reasoning models across six datasets, it finds calibration gains in 33 of 36 settings, driven by slow-thinking behaviors that allow dynamic confidence updates during CoT. Ablation experiments demonstrate that non-linear reasoning, alternative exploration, and backtracking contribute to improved calibration, while even non-reasoning models gain calibration when prompted to slow-think via in-context learning. These findings highlight slow thinking as a key mechanism for producing trustworthy, uncertainty-aware LLMs and suggest practical prompts to improve calibration across diverse tasks.

Abstract

Despite their strengths, large language models (LLMs) often fail to communicate their confidence accurately, making it difficult to assess when they might be wrong and limiting their reliability. In this work, we demonstrate that reasoning models that engage in extended chain-of-thought (CoT) reasoning exhibit superior performance not only in problem-solving but also in accurately expressing their confidence. Specifically, we benchmark six reasoning models across six datasets and find that they achieve strictly better confidence calibration than their non-reasoning counterparts in 33 out of the 36 settings. Our detailed analysis reveals that these gains in calibration stem from the slow thinking behaviors of reasoning models (e.g., exploring alternative approaches and backtracking) which enable them to adjust their confidence dynamically throughout their CoT, making it progressively more accurate. In particular, we find that reasoning models become increasingly better calibrated as their CoT unfolds, a trend not observed in non-reasoning models. Moreover, removing slow thinking behaviors from the CoT leads to a significant drop in calibration. Lastly, we show that non-reasoning models also demonstrate enhanced calibration when simply guided to slow think via in-context learning, fully isolating slow thinking as the source of the calibration gains.

Paper Structure

This paper contains 37 sections, 5 figures, 12 tables.

Figures (5)

  • Figure 1: R1-Distill-Qwen-32B dynamically refines its confidence throughout CoT (left) as it engages in various slow thinking behaviors (right). We collect the model’s answer and confidence at each token position by appending "</think>\\ nAnswer:" to terminate the reasoning process early. Note that the model's answer is consistent and correct ("Cleisthenes") at all points, while the confidence fluctuates. For visual clarity, the data was smoothed using a Butterworth low-pass filter. See Appendix \ref{['appendix:full_qual']} for the full untruncated CoT.
  • Figure 2: Accuracy (left) and sample frequency (right) across confidence bins for Qwen2.5-32B-Instruct and R1-Distill-Qwen-32B on TriviaQA.
  • Figure 3: Relative change in Brier Score as CoT progresses on NonambigQA. Non-reasoning models are represented by triangles ($\triangle$), and reasoning models by circles ($\bullet$), with each model pair shown in matching colors.
  • Figure 4: Relative change in Brier Score on NonambigQA under budget forcing (left) and across different model scales (right).
  • Figure 5: Full CoT version of Figure \ref{['fig:example']}.