Table of Contents
Fetching ...

Thinking Out Loud: Do Reasoning Models Know When They're Right?

Qingcheng Zeng, Weihao Xuan, Leyang Cui, Rob Voigt

TL;DR

The paper investigates whether verbalized confidence in large reasoning models aligns with actual correctness by comparing instruction tuning, SFT on long reasoning traces (distillation), and reinforcement learning for reasoning. Across math, factuality, and scientific/general reasoning tasks, reasoning-focused training improves accuracy and calibration on reasoning benchmarks, with RL offering additional cross-domain benefits, though small models may misjudge their knowledge boundaries on factual tasks. The results highlight a trade-off: enhanced reasoning ability and confidence can come with reduced faithfulness in certain domains, revealing a 'reasoning tax' that must be managed through reward design and evaluation. Overall, verbalized confidence serves as a natural interface for human-AI collaboration, but deployment demands careful calibration-focused training and robust assessment of knowledge boundaries and abstention behavior.

Abstract

Large reasoning models (LRMs) have recently demonstrated impressive capabilities in complex reasoning tasks by leveraging increased test-time computation and exhibiting behaviors reminiscent of human-like self-reflection. While LRMs show a clear capacity for valuable self-reflection, how this ability interacts with other model behaviors remains underexplored. We investigate this connection by analyzing verbalized confidence, how models articulate their certainty, as a lens into the nature of self-reflection in LRMs. We find that supervised fine-tuning on reasoning traces (i.e., distillation) and reinforcement learning can improve verbalized calibration in reasoning-intensive settings in a progressive, laddered fashion. However, our results also indicate that reasoning models may possess a diminished awareness of their own knowledge boundaries, as evidenced by significantly lower "I don't know" response rates on factuality benchmarks. Moreover, we examine the relationship between verbalized confidence and reasoning chains, finding that models tend to express higher confidence when providing shorter or less elaborate reasoning. Our findings highlight how reasoning-oriented training can enhance performance in reasoning-centric tasks while potentially incurring a "reasoning tax," a cost reflected in the model's reduced ability to accurately recognize the limits of its own knowledge in small-scale models. More broadly, our work showcases how this erosion of knowledge boundaries can compromise model faithfulness, as models grow more confident without a commensurate understanding of when they should abstain.

Thinking Out Loud: Do Reasoning Models Know When They're Right?

TL;DR

The paper investigates whether verbalized confidence in large reasoning models aligns with actual correctness by comparing instruction tuning, SFT on long reasoning traces (distillation), and reinforcement learning for reasoning. Across math, factuality, and scientific/general reasoning tasks, reasoning-focused training improves accuracy and calibration on reasoning benchmarks, with RL offering additional cross-domain benefits, though small models may misjudge their knowledge boundaries on factual tasks. The results highlight a trade-off: enhanced reasoning ability and confidence can come with reduced faithfulness in certain domains, revealing a 'reasoning tax' that must be managed through reward design and evaluation. Overall, verbalized confidence serves as a natural interface for human-AI collaboration, but deployment demands careful calibration-focused training and robust assessment of knowledge boundaries and abstention behavior.

Abstract

Large reasoning models (LRMs) have recently demonstrated impressive capabilities in complex reasoning tasks by leveraging increased test-time computation and exhibiting behaviors reminiscent of human-like self-reflection. While LRMs show a clear capacity for valuable self-reflection, how this ability interacts with other model behaviors remains underexplored. We investigate this connection by analyzing verbalized confidence, how models articulate their certainty, as a lens into the nature of self-reflection in LRMs. We find that supervised fine-tuning on reasoning traces (i.e., distillation) and reinforcement learning can improve verbalized calibration in reasoning-intensive settings in a progressive, laddered fashion. However, our results also indicate that reasoning models may possess a diminished awareness of their own knowledge boundaries, as evidenced by significantly lower "I don't know" response rates on factuality benchmarks. Moreover, we examine the relationship between verbalized confidence and reasoning chains, finding that models tend to express higher confidence when providing shorter or less elaborate reasoning. Our findings highlight how reasoning-oriented training can enhance performance in reasoning-centric tasks while potentially incurring a "reasoning tax," a cost reflected in the model's reduced ability to accurately recognize the limits of its own knowledge in small-scale models. More broadly, our work showcases how this erosion of knowledge boundaries can compromise model faithfulness, as models grow more confident without a commensurate understanding of when they should abstain.

Paper Structure

This paper contains 18 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: An illustration of different pathways of LLM/LRM training; we compare three key categories of models for their calibration performances.
  • Figure 2: Verbalized confidence evaluation across various tasks, prompting strategies, and model types.
  • Figure 3: The relationship between accuracy and ECE on SuperGPQA benchmark. Upper left represents better models.
  • Figure 4: The relationship between length and confidence, calibration, and accuracy on GPQA-Diamond and SuperGPQA benchmarks.