Table of Contents
Fetching ...

Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks

Jiannan Guan, Qiguang Chen, Libo Qin, Dengyun Peng, Jinhao Liu, Liangyu Huo, Jian Xie, Wanxiang Che

TL;DR

The paper identifies reasoning overconfidence as a key failure mode of LLMs on multi-solution tasks, where models overstate the completeness of their solution sets. It introduces MuSoBench to study completeness across TimeTabling and SubsetSum and compares Short-CoT versus Long-CoT prompting, showing Long-CoT improves recall, calibration, and solution diversity. The authors propose the cognitive-rigidity hypothesis, supported by attention-entropy analyses, as a mechanism driving premature convergence on narrow reasoning paths. They offer mitigation strategies—reflection, exploratory prompts, and parallel self-consistency—that reduce overconfidence and increase coverage, shifting evaluation toward comprehensive search and complete solution enumeration.

Abstract

Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks that require generating comprehensive and diverse answers. We attribute this limitation to \textbf{reasoning overconfidence}: a tendency to express undue certainty in an incomplete solution set. To examine the effect, we introduce \textit{MuSoBench}, a benchmark of multi-solution problems. Experiments show that the conventional short chain-of-thought (Short-CoT) prompting paradigm exhibits pronounced overconfidence, whereas the emerging long chain-of-thought (Long-CoT) approach mitigates it through iterative exploration and self-reflection. We further characterise observable behaviours and influential factors. To probe the underlying cause, we propose the \textbf{cognitive-rigidity hypothesis}, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths. An attention-entropy analysis offers preliminary support for this view. These findings provide tools for assessing the completeness of LLM reasoning and highlight the need to move evaluation beyond single-answer accuracy toward comprehensive exploration.

Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks

TL;DR

The paper identifies reasoning overconfidence as a key failure mode of LLMs on multi-solution tasks, where models overstate the completeness of their solution sets. It introduces MuSoBench to study completeness across TimeTabling and SubsetSum and compares Short-CoT versus Long-CoT prompting, showing Long-CoT improves recall, calibration, and solution diversity. The authors propose the cognitive-rigidity hypothesis, supported by attention-entropy analyses, as a mechanism driving premature convergence on narrow reasoning paths. They offer mitigation strategies—reflection, exploratory prompts, and parallel self-consistency—that reduce overconfidence and increase coverage, shifting evaluation toward comprehensive search and complete solution enumeration.

Abstract

Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks that require generating comprehensive and diverse answers. We attribute this limitation to \textbf{reasoning overconfidence}: a tendency to express undue certainty in an incomplete solution set. To examine the effect, we introduce \textit{MuSoBench}, a benchmark of multi-solution problems. Experiments show that the conventional short chain-of-thought (Short-CoT) prompting paradigm exhibits pronounced overconfidence, whereas the emerging long chain-of-thought (Long-CoT) approach mitigates it through iterative exploration and self-reflection. We further characterise observable behaviours and influential factors. To probe the underlying cause, we propose the \textbf{cognitive-rigidity hypothesis}, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths. An attention-entropy analysis offers preliminary support for this view. These findings provide tools for assessing the completeness of LLM reasoning and highlight the need to move evaluation beyond single-answer accuracy toward comprehensive exploration.

Paper Structure

This paper contains 33 sections, 8 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: On multi-solution tasks, the model suffers from reasoning overconfidence, exhibiting excessively high confidence while exploring only a few reasoning paths. This leads to a poor completeness score for the final task.
  • Figure 2: Distribution plots of recall vs. confidence on the TimeTabling dataset. The plots clearly show Short-CoT results clustering in the low-recall, high-confidence corner (red). For SubsetSum results, see Figure \ref{['fig:recall-vs-confidence-3d-subsetsum']} in Appendix.
  • Figure 3: Calibration and performance of Short-CoT vs. Long-CoT on TimeTabling dataset. As shown in (a), the diagonal line represents perfect calibration. Long-CoT models (blue) are better calibrated than Short-CoT models (orange). As shown in (b), Long-CoT models achieve significantly higher recall than Short-CoT models. For SubsetSum results, see Figure \ref{['fig:reliability-diagrams-and-short-cot-long-cot-recall-subsetsum']} in Appendix.
  • Figure 4: The arrows indicate the movement of model confidence and performance from Short-CoT to Long-CoT. The results show that adopting Long-CoT causes most data points to shift toward the diagonal, indicating improved calibration (red). Results for SubsetSum are shown in Figure \ref{['fig:recall-vs-confidence-movement-subsetsum']} in Appendix.
  • Figure 5: Factors that influence reasoning overconfidence. (a) A strong negative correlation shows that Long-CoT has moderate confidence. (b) As task complexity rises, Short-CoT keeps unjustifiably high confidence despite falling recall, indicating poor self-monitoring, whereas Long-CoT lowers its confidence in line with the harder setting, demonstrating better calibration. (c) Decoding temperature has little effect on recall or expected calibration error. More results see in Figure \ref{['fig:complexity-vs-confidence-and-length-and-temperature-subsetsum']} in Appendix.
  • ...and 9 more figures