Table of Contents
Fetching ...

LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval

Elliot L. Epstein, John Winnicki, Thanawat Sornwanee, Rajat Dwaraknath

TL;DR

This work tackles the reliability of uncertainty quantification in large language models by introducing FermiEval, a benchmark of Fermi-style estimation tasks designed to probe confidence interval calibration. It demonstrates pervasive overconfidence: nominal $99\%$ intervals often miss the ground truth, motivating post-hoc adjustment via conformal prediction and direct log-probability elicitation. The proposed conformal calibration guarantees finite-sample coverage and substantially improves Winkler scores, while the log-probability and temperature methods offer practical, complementary gains. The authors also propose a perception-tunnel theory explaining why LLMs truncate their inferred distributions, and they provide a formal framework for tail-consistent interval estimation. Together, these contributions advance reliable uncertainty quantification for mathematical reasoning in LLMs, with broad implications for decision-making and AI safety.

Abstract

Large language models (LLMs) excel at numerical estimation but struggle to correctly quantify uncertainty. We study how well LLMs construct confidence intervals around their own answers and find that they are systematically overconfident. To evaluate this behavior, we introduce FermiEval, a benchmark of Fermi-style estimation questions with a rigorous scoring rule for confidence interval coverage and sharpness. Across several modern models, nominal 99\% intervals cover the true answer only 65\% of the time on average. With a conformal prediction based approach that adjusts the intervals, we obtain accurate 99\% observed coverage, and the Winkler interval score decreases by 54\%. We also propose direct log-probability elicitation and quantile adjustment methods, which further reduce overconfidence at high confidence levels. Finally, we develop a perception-tunnel theory explaining why LLMs exhibit overconfidence: when reasoning under uncertainty, they act as if sampling from a truncated region of their inferred distribution, neglecting its tails.

LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval

TL;DR

This work tackles the reliability of uncertainty quantification in large language models by introducing FermiEval, a benchmark of Fermi-style estimation tasks designed to probe confidence interval calibration. It demonstrates pervasive overconfidence: nominal intervals often miss the ground truth, motivating post-hoc adjustment via conformal prediction and direct log-probability elicitation. The proposed conformal calibration guarantees finite-sample coverage and substantially improves Winkler scores, while the log-probability and temperature methods offer practical, complementary gains. The authors also propose a perception-tunnel theory explaining why LLMs truncate their inferred distributions, and they provide a formal framework for tail-consistent interval estimation. Together, these contributions advance reliable uncertainty quantification for mathematical reasoning in LLMs, with broad implications for decision-making and AI safety.

Abstract

Large language models (LLMs) excel at numerical estimation but struggle to correctly quantify uncertainty. We study how well LLMs construct confidence intervals around their own answers and find that they are systematically overconfident. To evaluate this behavior, we introduce FermiEval, a benchmark of Fermi-style estimation questions with a rigorous scoring rule for confidence interval coverage and sharpness. Across several modern models, nominal 99\% intervals cover the true answer only 65\% of the time on average. With a conformal prediction based approach that adjusts the intervals, we obtain accurate 99\% observed coverage, and the Winkler interval score decreases by 54\%. We also propose direct log-probability elicitation and quantile adjustment methods, which further reduce overconfidence at high confidence levels. Finally, we develop a perception-tunnel theory explaining why LLMs exhibit overconfidence: when reasoning under uncertainty, they act as if sampling from a truncated region of their inferred distribution, neglecting its tails.

Paper Structure

This paper contains 38 sections, 4 theorems, 31 equations, 6 figures, 2 tables.

Key Result

Theorem 1

(Consistent Estimator for Tails) Consider a continuous distribution $F \in \Delta((-\infty, \infty))$ with a support being a closed interval. Assume that the theory theory is true, meaning that, in each query $i \in \mathbb{N}$, the perceived distribution is $F_{I_i, I_i+\beta}$, where $I_i \overset and we will have that

Figures (6)

  • Figure 1: Calibration curves for representative models. The dashed line indicates perfect calibration ($y=x$). Observed coverage is consistently below nominal coverage, revealing systematic overconfidence in the intervals produced by current LLMs.
  • Figure 2: Base vs. Conformal vs. Logprob scores on test data (lower is better). Conformal calibration consistently improves over the base across models and $p$. The logprob heuristic helps at stricter targets ($p\ge 0.95$) but can underperform at $p=0.90$.
  • Figure 3: LLM Vision Tunnel: The inferred distribution has its pdf shown the bold curve. However, in each query, LLM only perceives a section of the distribution. For example, in the first query, LLM may only perceive the orange distribution, while it can perceive the green distribution in the second query. The answer for the lower bound and upper bound are then different.
  • Figure 4: The distribution of lower bound $L$ as a random variable where the randomness stemmed from the random vision tunnel of LLM is shown as a red distribution, while that of the random upper bound $U$ is shown as a blue distribution. We will see that there is an overlap between the two distributions. The original distribution $F$ is displayed by its pdf as a bold curve. By denoting $\hat{L}$ to be the $\frac{\alpha}{2}$ quantile of the distribution of $L$ and $\hat{U}$ to be the $1-\frac{\alpha}{2}$ quantile of the distribution of $U$, we will have that both of them will also serve as $\frac{\alpha}{2}$ and $1-\frac{\alpha}{2}$ quantiles of the original distribution $F$. Note that this result is independent of the perception size $\beta$.
  • Figure 5: Calibration curves for open-source models. The dashed line indicates perfect calibration ($y=x$). Observed coverage is consistently below nominal coverage, revealing systematic overconfidence in the intervals produced by current LLMs.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Theorem 1
  • proof
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Theorem 2: Self-Consistency Reduces Tunnel Vision
  • proof