Table of Contents
Fetching ...

Certain but not Probable? Differentiating Certainty from Probability in LLM Token Outputs for Probabilistic Scenarios

Autumn Toney-Wails, Ryan Wails

TL;DR

This study tackles uncertainty quantification in LLMs for probabilistic tasks by examining whether token-level certainty (from $ ext{Pr}(t)$ and $ H(T)$) aligns with theoretical probability distributions. Using GPT-4.1 and DeepSeek on ten well-defined probabilistic prompts, it distinguishes output validity (conformance to constraints) from distributional alignment (match to $ ext{theoretical ext{ }}Pr$ and $H$). Findings show perfect validity across samples but persistent misalignment in token probabilities and entropies, even under explicit sampling instructions; GPT-4.1 often calibrates better than DeepSeek but both fall short in entropy alignment. The work argues for extending UQ to separately or jointly measure validity and distribution alignment, informing safe deployment in probability-sensitive applications and urging calibration or post-processing when probabilistic accuracy is critical.

Abstract

Reliable uncertainty quantification (UQ) is essential for ensuring trustworthy downstream use of large language models, especially when they are deployed in decision-support and other knowledge-intensive applications. Model certainty can be estimated from token logits, with derived probability and entropy values offering insight into performance on the prompt task. However, this approach may be inadequate for probabilistic scenarios, where the probabilities of token outputs are expected to align with the theoretical probabilities of the possible outcomes. We investigate the relationship between token certainty and alignment with theoretical probability distributions in well-defined probabilistic scenarios. Using GPT-4.1 and DeepSeek-Chat, we evaluate model responses to ten prompts involving probability (e.g., roll a six-sided die), both with and without explicit probability cues in the prompt (e.g., roll a fair six-sided die). We measure two dimensions: (1) response validity with respect to scenario constraints, and (2) alignment between token-level output probabilities and theoretical probabilities. Our results indicate that, while both models achieve perfect in-domain response accuracy across all prompt scenarios, their token-level probability and entropy values consistently diverge from the corresponding theoretical distributions.

Certain but not Probable? Differentiating Certainty from Probability in LLM Token Outputs for Probabilistic Scenarios

TL;DR

This study tackles uncertainty quantification in LLMs for probabilistic tasks by examining whether token-level certainty (from and ) aligns with theoretical probability distributions. Using GPT-4.1 and DeepSeek on ten well-defined probabilistic prompts, it distinguishes output validity (conformance to constraints) from distributional alignment (match to and ). Findings show perfect validity across samples but persistent misalignment in token probabilities and entropies, even under explicit sampling instructions; GPT-4.1 often calibrates better than DeepSeek but both fall short in entropy alignment. The work argues for extending UQ to separately or jointly measure validity and distribution alignment, informing safe deployment in probability-sensitive applications and urging calibration or post-processing when probabilistic accuracy is critical.

Abstract

Reliable uncertainty quantification (UQ) is essential for ensuring trustworthy downstream use of large language models, especially when they are deployed in decision-support and other knowledge-intensive applications. Model certainty can be estimated from token logits, with derived probability and entropy values offering insight into performance on the prompt task. However, this approach may be inadequate for probabilistic scenarios, where the probabilities of token outputs are expected to align with the theoretical probabilities of the possible outcomes. We investigate the relationship between token certainty and alignment with theoretical probability distributions in well-defined probabilistic scenarios. Using GPT-4.1 and DeepSeek-Chat, we evaluate model responses to ten prompts involving probability (e.g., roll a six-sided die), both with and without explicit probability cues in the prompt (e.g., roll a fair six-sided die). We measure two dimensions: (1) response validity with respect to scenario constraints, and (2) alignment between token-level output probabilities and theoretical probabilities. Our results indicate that, while both models achieve perfect in-domain response accuracy across all prompt scenarios, their token-level probability and entropy values consistently diverge from the corresponding theoretical distributions.

Paper Structure

This paper contains 15 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Probabilistic scenario prompting and response evaluation design.
  • Figure 2: Comparisons of LLM token distributions to the theoretical distributions.
  • Figure 3: DeepSeek-Chat example dialogue for probabilistic reasoning about prompt scenarios.
  • Figure 4: GPT-4.1 example dialogue for probabilistic reasoning about prompt scenarios.