Table of Contents
Fetching ...

The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models

Robert Welch, Emir Konuk, Kevin Smith

Abstract

Vision-language models (VLMs) are increasingly deployed in high-stakes settings where reliable uncertainty quantification (UQ) is as important as predictive accuracy. Extended reasoning via chain-of-thought (CoT) prompting or reasoning-trained models has become ubiquitous in modern VLM pipelines, yet its effect on UQ reliability remains poorly understood. We show that reasoning consistently degrades the quality of most uncertainty estimates, even when it improves task accuracy. We identify implicit answer conditioning as the primary mechanism: as reasoning traces converge on a conclusion before the final answer is generated, token probabilities increasingly reflect consistency with the model's own reasoning trace rather than uncertainty about correctness. In effect, the model becomes overconfident in its answer. In contrast, agreement-based consistency remains robust and often improves under reasoning, making it a practical choice for uncertainty estimation in reasoning-enabled VLMs.

The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models

Abstract

Vision-language models (VLMs) are increasingly deployed in high-stakes settings where reliable uncertainty quantification (UQ) is as important as predictive accuracy. Extended reasoning via chain-of-thought (CoT) prompting or reasoning-trained models has become ubiquitous in modern VLM pipelines, yet its effect on UQ reliability remains poorly understood. We show that reasoning consistently degrades the quality of most uncertainty estimates, even when it improves task accuracy. We identify implicit answer conditioning as the primary mechanism: as reasoning traces converge on a conclusion before the final answer is generated, token probabilities increasingly reflect consistency with the model's own reasoning trace rather than uncertainty about correctness. In effect, the model becomes overconfident in its answer. In contrast, agreement-based consistency remains robust and often improves under reasoning, making it a practical choice for uncertainty estimation in reasoning-enabled VLMs.
Paper Structure (41 sections, 9 equations, 7 figures, 10 tables)

This paper contains 41 sections, 9 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Example illustrating reasoning-induced overconfidence. Color intensity reflects answer-token likelihood, which increases as the reasoning trace converges. As the model generates a chain-of-thought explanation, the evolving reasoning increasingly constrains the answer tokens (“vertical garden”), making predictions overly confident, even when the prediction is incorrect (ground truth: vine). We refer to this effect as implicit answer conditioning: the model progressively shifts from grounding the answer in the image to reinforcing its own reasoning trace, which in turn degrades reliability of many uncertainty estimates.
  • Figure 2: Confidence shifts induced by CoT reasoning. Bars show the percentage of samples whose confidence increases (solid color), decreases (striped color), or remains unchanged (gray) when moving from no-CoT to CoT inference, grouped by correctness. ATL-based estimates systematically increase confidence, including samples that become or remain incorrect, indicating confidence inflation unrelated to correctness. Self-reported confidence (SRC) and Consistency do not exhibit systematic inflation.
  • Figure 3: Evidence for Implicit Answer Conditioning. (a) Aggregated spearman correlations between reasoning length and confidence under CoT (Fisher z-transformed; 95% CI). Longer reasoning traces correspond to lower confidence. (b) Confidence as a function of how often the final answer appears in the reasoning trace. Repeated answer mentions substantially increase answer-token likelihood. (c–d) Across datasets, answer frequency positively correlates with ATL-based uncertainty estimates. This correlation persists even for incorrect predictions, indicating correctness-agnostic confidence inflation. In contrast, Consistency and SRC show little sensitivity to answer frequency.
  • Figure 4: Effect of sample size $k$ on uncertainty quality. Lower AUGRC is better, higher PRR is better.
  • Figure 5: Sample-level confidence shifts under Chain-of-Thought prompting for Qwen3-VL-32B-Instruct across different datasets. Each subfigure corresponds to a dataset and reports confidence changes conditioned on correctness transitions.
  • ...and 2 more figures