Table of Contents
Fetching ...

Confidence in the Reasoning of Large Language Models

Yudi Pawitan, Chris Holmes

TL;DR

This study interrogates how confident large language models are in their reasoning and how this confidence relates to correctness. By comparing GPT4o, GPT4-turbo, and Mistral on challenging BBH Hard tasks (causal judgment and formal fallacies) and statistical puzzles, the authors quantify both qualitative and self-reported confidence under varied prompting, including Self-Discover prompting. They find that although LLMs outperform random guessing, their confidence signals are often misaligned with reality: high initial accuracy does not guarantee robust self-correction upon reconsideration, and self-reported confidence tends to be overstated. The work shows that prompt design markedly influences confidence dynamics and that token-level probabilities only partially explain confidence, suggesting that current LLMs lack an internally coherent sense of true confidence suitable for independent expert critique. These findings underscore the need for careful prompting, human oversight, and supplementary uncertainty estimation when deploying LLMs as decision-support tools in high-stakes contexts.

Abstract

There is a growing literature on reasoning by large language models (LLMs), but the discussion on the uncertainty in their responses is still lacking. Our aim is to assess the extent of confidence that LLMs have in their answers and how it correlates with accuracy. Confidence is measured (i) qualitatively in terms of persistence in keeping their answer when prompted to reconsider, and (ii) quantitatively in terms of self-reported confidence score. We investigate the performance of three LLMs -- GPT4o, GPT4-turbo and Mistral -- on two benchmark sets of questions on causal judgement and formal fallacies and a set of probability and statistical puzzles and paradoxes. Although the LLMs show significantly better performance than random guessing, there is a wide variability in their tendency to change their initial answers. There is a positive correlation between qualitative confidence and accuracy, but the overall accuracy for the second answer is often worse than for the first answer. There is a strong tendency to overstate the self-reported confidence score. Confidence is only partially explained by the underlying token-level probability. The material effects of prompting on qualitative confidence and the strong tendency for overconfidence indicate that current LLMs do not have any internally coherent sense of confidence.

Confidence in the Reasoning of Large Language Models

TL;DR

This study interrogates how confident large language models are in their reasoning and how this confidence relates to correctness. By comparing GPT4o, GPT4-turbo, and Mistral on challenging BBH Hard tasks (causal judgment and formal fallacies) and statistical puzzles, the authors quantify both qualitative and self-reported confidence under varied prompting, including Self-Discover prompting. They find that although LLMs outperform random guessing, their confidence signals are often misaligned with reality: high initial accuracy does not guarantee robust self-correction upon reconsideration, and self-reported confidence tends to be overstated. The work shows that prompt design markedly influences confidence dynamics and that token-level probabilities only partially explain confidence, suggesting that current LLMs lack an internally coherent sense of true confidence suitable for independent expert critique. These findings underscore the need for careful prompting, human oversight, and supplementary uncertainty estimation when deploying LLMs as decision-support tools in high-stakes contexts.

Abstract

There is a growing literature on reasoning by large language models (LLMs), but the discussion on the uncertainty in their responses is still lacking. Our aim is to assess the extent of confidence that LLMs have in their answers and how it correlates with accuracy. Confidence is measured (i) qualitatively in terms of persistence in keeping their answer when prompted to reconsider, and (ii) quantitatively in terms of self-reported confidence score. We investigate the performance of three LLMs -- GPT4o, GPT4-turbo and Mistral -- on two benchmark sets of questions on causal judgement and formal fallacies and a set of probability and statistical puzzles and paradoxes. Although the LLMs show significantly better performance than random guessing, there is a wide variability in their tendency to change their initial answers. There is a positive correlation between qualitative confidence and accuracy, but the overall accuracy for the second answer is often worse than for the first answer. There is a strong tendency to overstate the self-reported confidence score. Confidence is only partially explained by the underlying token-level probability. The material effects of prompting on qualitative confidence and the strong tendency for overconfidence indicate that current LLMs do not have any internally coherent sense of confidence.

Paper Structure

This paper contains 24 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison of the LLMs in the causal judgment formal fallacies questions and statistical puzzles. 'First answer' is based on a direct zero-shot prompt and followed by the Simple prompt to think again carefully ('Rethink'). Random guesses have an expected accuracy 0.5 (dotted line), and standard deviations 0.037 and 0.032 for the causal judgement and the formal fallacies tasks, respectively; the corresponding values for the statistical puzzles are 0.39 (dotted line) and 0.07. P-values for the comparisons of accuracies and proportions are given in Tables \ref{['table:bench']} and \ref{['table:puzzles']} in the Appendix.
  • Figure 2: Comparison of the LLMs on the tendency to change their initial answers in the causal judgement task (n = 187 questions) after Simple, Neutral, and Post-confidence rethink prompts.
  • Figure 3: Accuracy and proportion of keeping the first answer as a function of token probability. The latter is based on the Simple (black), Neutral (red) and Post-confidence (blue) rethink prompts. The scattered points are the raw values based on pre-binned/local proportions. The dashed red lines in the first column are lines of identity, which are curved because of the -log-log probability scale.
  • Figure 1A: Distribution of token probabilities for GPT4o and GPT4-turbo for the yes-no and valid-invalid answers in the causal judgement and formal fallacies tasks. Note that the scale is put in -log-log scale in order to stretch the super-crowding of values near one. The median token probabilities are $> 0.995$, except for GPT4o in the formal fallacies task (0.93).
  • Figure 2A: Accuracy and the proportion of changing answer as a function of temperature for GPT4o and GPT4-turbo in the causal judgement (CJ, red lines) and formal fallacies (FF, blue lines) tasks. The bottom figures show the accuracy difference and the proportion of changing answers in independent runs (sessions). The latter is to be contrasted with the top-right figure, which is based on answers after a rethink prompt in the same session. In the bottom-left plot, the red lines for CJ-GPT4o and CJ-GPT4t coincide. Overall, the temperature effect on average accuracy appears to be small, especially up to temperature 1 and not directionally consistent. A similar result is seen for the tendency to change answer after rethinking, except for GPT4o in the formal fallacies task, where the proportion of changing answer goes from 0.17 to 0.34 as the temperature goes from 0 to 1.5. A more consistent effect is seen on the proportion of changing answer on independent runs (i.e. not based on rethinking), where higher temperatures generally lead to higher proportion of changing answer.