Table of Contents
Fetching ...

The Illusion of Certainty: Uncertainty quantification for LLMs fails under ambiguity

Tim Tomov, Dominik Fuchsgruber, Tom Wollschläger, Stephan Günnemann

TL;DR

This work shows that uncertainty quantification (UQ) methods for large language models (LLMs) fail under realistic aleatoric uncertainty due to inherent ambiguity in language. By introducing MAQA* and AmbigQA* with ground-truth distributions $p^*$ estimated from corpus co-occurrence, the authors provide a principled framework to evaluate epistemic uncertainty (EU) via $KL(p^* \\| p)$, revealing that predictive-variation, internal-representation, and ensemble-based estimators perform near random when ambiguity is present. The authors supply theoretical explanations for the zero-aleatoric-uncertainty regime where these estimators can be justified, and show that once non-zero AU is introduced, those justifications collapse because $p^*$ can reside anywhere in the probability simplex, breaking standard signals. They advocate training-time uncertainty modeling (e.g., evidential deep learning, joint distributions) and establish MAQA*/AmbigQA* as benchmarks to drive development of reliable estimators suitable for real-world, ambiguous language tasks.

Abstract

Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is critical for trustworthy deployment. While real-world language is inherently ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically benchmarked against tasks with no ambiguity. In this work, we demonstrate that while current uncertainty estimators perform well under the restrictive assumption of no ambiguity, they degrade to close-to-random performance on ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions estimated from factual co-occurrence. We find this performance deterioration to be consistent across different estimation paradigms: using the predictive distribution itself, internal representations throughout the model, and an ensemble of models. We show that this phenomenon can be theoretically explained, revealing that predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity. Overall, our study reveals a key shortcoming of current UQ methods for LLMs and motivates a rethinking of current modeling paradigms.

The Illusion of Certainty: Uncertainty quantification for LLMs fails under ambiguity

TL;DR

This work shows that uncertainty quantification (UQ) methods for large language models (LLMs) fail under realistic aleatoric uncertainty due to inherent ambiguity in language. By introducing MAQA* and AmbigQA* with ground-truth distributions estimated from corpus co-occurrence, the authors provide a principled framework to evaluate epistemic uncertainty (EU) via , revealing that predictive-variation, internal-representation, and ensemble-based estimators perform near random when ambiguity is present. The authors supply theoretical explanations for the zero-aleatoric-uncertainty regime where these estimators can be justified, and show that once non-zero AU is introduced, those justifications collapse because can reside anywhere in the probability simplex, breaking standard signals. They advocate training-time uncertainty modeling (e.g., evidential deep learning, joint distributions) and establish MAQA*/AmbigQA* as benchmarks to drive development of reliable estimators suitable for real-world, ambiguous language tasks.

Abstract

Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is critical for trustworthy deployment. While real-world language is inherently ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically benchmarked against tasks with no ambiguity. In this work, we demonstrate that while current uncertainty estimators perform well under the restrictive assumption of no ambiguity, they degrade to close-to-random performance on ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions estimated from factual co-occurrence. We find this performance deterioration to be consistent across different estimation paradigms: using the predictive distribution itself, internal representations throughout the model, and an ensemble of models. We show that this phenomenon can be theoretically explained, revealing that predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity. Overall, our study reveals a key shortcoming of current UQ methods for LLMs and motivates a rethinking of current modeling paradigms.

Paper Structure

This paper contains 58 sections, 9 theorems, 29 equations, 7 figures, 10 tables.

Key Result

Theorem 1

Let there be $K\!\ge\!2$ classes and $\delta\in[0,\log K]$ be a threshold on the entropy indicating uncertainty. Furthermore, let $\alpha_\delta$ be the maximal possible probability on some class s.t. $H(p)\geq\delta$. Then the epistemic uncertainty with $H(p)\ge\delta$ is at least:

Figures (7)

  • Figure 1: Theoretical Insights on 3-class simplexLeft: Under zero aleatoric uncertainty, high entropy guarantees low EU, since all possible $p^*$ are far away (\ref{['thm:high_entropy']}). Assuming a well-trained model, observing a low entropy distribution likely indicates low EU as the model cannot frequently be confidently incorrect (\ref{['thm:low_entropy']}). Right: Under non-trivial aleatoric uncertainty, observing high or low entropy does not provide information about the EU, since the ground-truth distribution $p^*$ is not constraint to any particular location in the probability simplex.
  • Figure 2: Left: Distribution of ground-truth entropy $H(p^*)$ across questions in MAQA$^*$ and AmbigQA$^*$, Right: Distribution of JS divergences between different proxys for estimating $p^*$. The low divergence validates the quality of these distributions.
  • Figure 3: Relationship between prediction-based estimators and true epistemic uncertainty (EU) for Gemma 3-12B on MAQA$^*$. Left: Relationship between $H(p)$ and true $\mathrm{EU}$. If aleatoric uncertainty (AU) is zero predictive entropy and prediction-based EU correlate. This correlation vanishes under non-trivial AU. Lines indicate theoretical bounds on EU . Right The average ROC curve of prediction-based estimators for identifying predictions with high true EU ($\mathrm{EU} < \log(2)$) approaches random performance. Shaded regions represent one standard deviation over different estimators.
  • Figure 4: MLP regression performance across layers. Under zero AU, probes achieve satisfactory ranking capability in deeper layers. Under non-trivial AU, performance collapses significantly, showing that hidden states do not reliably encode EU when ambiguity is present.
  • Figure 5: Entropy collapse of Instruct models on MAQA$^*$ and AmbigQA$^*$
  • ...and 2 more figures

Theorems & Definitions (14)

  • Theorem 1: name=High Entropy $\Rightarrow$ High EU
  • Theorem 2: name=Low Entropy $\Rightarrow$ Low EU with High Probability
  • Proposition 1: Non-Identifiability of Epistemic Uncertainty
  • Proposition 2: High MI $\not\Rightarrow$ High EU
  • Proposition 2: Non-Identifiability of Epistemic Uncertainty
  • proof
  • Proposition 2: High MI $\not\Rightarrow$ High EU
  • proof
  • Proposition 3: Zero aleatoric uncertainty implies EU is NLL
  • proof
  • ...and 4 more