Table of Contents
Fetching ...

Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs

Jakub Podolak, Rajeev Verma

TL;DR

The paper investigates uncertainty estimation in large language models by comparing Verbalized Confidence (VC) and Semantic Entropy (SE) under matched test-time budgets. It shows that VC is severely over-confident without reasoning, while SE remains well-calibrated due to explicit exploration of the predictive space; extending VC with longer reasoning or sampling brings its calibration close to SE. A reader analysis demonstrates that much of the confidence signal can be recovered from the reasoning trace itself, supporting the view that uncertainty emerges from surface-level exploration rather than an intrinsic latent state. The work concludes that reliable uncertainty estimation hinges on explicit exploration of the generative space, and that self-reported confidence becomes trustworthy primarily after such exploration, with implications for deploying uncertain outputs in critical settings.

Abstract

We study the source of uncertainty in DeepSeek R1-32B by analyzing its self-reported verbal confidence on question answering (QA) tasks. In the default answer-then-confidence setting, the model is regularly over-confident, whereas semantic entropy - obtained by sampling many responses - remains reliable. We hypothesize that this is because of semantic entropy's larger test-time compute, which lets us explore the model's predictive distribution. We show that granting DeepSeek the budget to explore its distribution by forcing a long chain-of-thought before the final answer greatly improves its verbal score effectiveness, even on simple fact-retrieval questions that normally require no reasoning. Furthermore, a separate reader model that sees only the chain can reconstruct very similar confidences, indicating the verbal score might be merely a statistic of the alternatives surfaced during reasoning. Our analysis concludes that reliable uncertainty estimation requires explicit exploration of the generative space, and self-reported confidence is trustworthy only after such exploration.

Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs

TL;DR

The paper investigates uncertainty estimation in large language models by comparing Verbalized Confidence (VC) and Semantic Entropy (SE) under matched test-time budgets. It shows that VC is severely over-confident without reasoning, while SE remains well-calibrated due to explicit exploration of the predictive space; extending VC with longer reasoning or sampling brings its calibration close to SE. A reader analysis demonstrates that much of the confidence signal can be recovered from the reasoning trace itself, supporting the view that uncertainty emerges from surface-level exploration rather than an intrinsic latent state. The work concludes that reliable uncertainty estimation hinges on explicit exploration of the generative space, and that self-reported confidence becomes trustworthy primarily after such exploration, with implications for deploying uncertain outputs in critical settings.

Abstract

We study the source of uncertainty in DeepSeek R1-32B by analyzing its self-reported verbal confidence on question answering (QA) tasks. In the default answer-then-confidence setting, the model is regularly over-confident, whereas semantic entropy - obtained by sampling many responses - remains reliable. We hypothesize that this is because of semantic entropy's larger test-time compute, which lets us explore the model's predictive distribution. We show that granting DeepSeek the budget to explore its distribution by forcing a long chain-of-thought before the final answer greatly improves its verbal score effectiveness, even on simple fact-retrieval questions that normally require no reasoning. Furthermore, a separate reader model that sees only the chain can reconstruct very similar confidences, indicating the verbal score might be merely a statistic of the alternatives surfaced during reasoning. Our analysis concludes that reliable uncertainty estimation requires explicit exploration of the generative space, and self-reported confidence is trustworthy only after such exploration.

Paper Structure

This paper contains 20 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: DeepSeek R1-32B's Verbalized Confidence (VC) improves and matches Semantic Entropy's (SE) effectiveness, when longer reasoning is forced.. Our work suggests that it is the test-time exploration of the model’s predictive space, not the particular uncertainty heuristic, that makes confidence estimates reliable.
  • Figure 2: Separate reader matches the reliability of DeepSeek’s own Verbalized Confidence by just looking at the reasoning trace. With more reasoning tokens, the agreement between them (measured as absolute Spearman correlation) increases, and the effectiveness of both scores changes similarly.
  • Figure 3: Effectiveness and Accuracy of Verbalized Confidence with Forced Reasoning vs Semantic Entropy. (a) Full overview. (b) Fact retrieval results. (c) Mathematical reasoning results. Note: The remaining 10 samples not falling into the Fact Retrieval or Mathematical Reasoning categories are included in the Full overview but not presented as separate plots.
  • Figure 4: Two tested methods of obtaining Final Answer and Confidence - Verbalized Confidence with Forced Reasoning (VC) works by prompting the model to reason for longer-until the fixed budget is exhausted - before stating the answer and confidence. Semantic Entropy (SE) obtains 10 independent answers that are later clustered semantically to identify the most frequent one, and to calculate the entropy in the answer distribution.
  • Figure 5: Internal composition of our used data sample.
  • ...and 3 more figures