Table of Contents
Fetching ...

Measuring Language Model Hallucinations Through Distributional Correctness

Thomas F Burns

TL;DR

This work argues that single-answer accuracy fails to capture a model's belief distribution and its willingness to abstain. It introduces the Distributional Correctness Score (DCS), a distribution-level utility that rewards correct belief distributions, hedging toward abstention, and penalizes harmful overconfidence, with a default range of $[-1,1]$ and abstention as a neutral anchor. The authors establish theoretical properties, including score bounds, incentive ordering, and an information-theoretic bound, and empirically evaluate DCS on 12 benchmarks with six models, finding widespread epistemic overconfidence and that many benchmarks yield negative DCS across models. They show DCS reveals nuanced behavior that traditional accuracy-based metrics miss, particularly on safety and fairness domains, and discuss how DCS can be adapted via loadings to reflect varying risk profiles in deployment. Overall, DCS provides a principled, abstention-aware framework for evaluating model uncertainty and reducing hallucinations in practical AI systems.

Abstract

Common evaluation paradigms for language models focus on scoring single responses through accuracy metrics or proper scoring rules, failing to capture the full richness of a model's belief state. Recent work illustrates that language models hallucinate in-part because they are optimised to be good test-takers under binary scoring schemes that reward any answer over abstention. While this insight naturally leads to penalty-based approaches, they ignore crucial distinctions in how models distribute uncertainty, for example between hedging toward incorrect answers versus hedging toward "I don't know" responses. A novel evaluation metric, the Distributional Correctness Score (DCS), is introduced to solve this problem, i.e., of not considering a model's entire probability distribution over answer choices. DCS naturally distinguishes between harmful overconfidence in wrong answers and uncertainty expressed through abstention, providing scores in an interpretable default range. Through theoretical analysis and illustrative examples, DCS is demonstrated to offer a more nuanced and aligned evaluation paradigm that incentivises models to express genuine uncertainty rather than guessing. Adapting 12 existing evaluation benchmarks to DCS's variants and measuring performance on six language models reveals that for half of the tested benchmarks scores are negative across all tested models, indicating significant tendencies towards hallucination.

Measuring Language Model Hallucinations Through Distributional Correctness

TL;DR

This work argues that single-answer accuracy fails to capture a model's belief distribution and its willingness to abstain. It introduces the Distributional Correctness Score (DCS), a distribution-level utility that rewards correct belief distributions, hedging toward abstention, and penalizes harmful overconfidence, with a default range of and abstention as a neutral anchor. The authors establish theoretical properties, including score bounds, incentive ordering, and an information-theoretic bound, and empirically evaluate DCS on 12 benchmarks with six models, finding widespread epistemic overconfidence and that many benchmarks yield negative DCS across models. They show DCS reveals nuanced behavior that traditional accuracy-based metrics miss, particularly on safety and fairness domains, and discuss how DCS can be adapted via loadings to reflect varying risk profiles in deployment. Overall, DCS provides a principled, abstention-aware framework for evaluating model uncertainty and reducing hallucinations in practical AI systems.

Abstract

Common evaluation paradigms for language models focus on scoring single responses through accuracy metrics or proper scoring rules, failing to capture the full richness of a model's belief state. Recent work illustrates that language models hallucinate in-part because they are optimised to be good test-takers under binary scoring schemes that reward any answer over abstention. While this insight naturally leads to penalty-based approaches, they ignore crucial distinctions in how models distribute uncertainty, for example between hedging toward incorrect answers versus hedging toward "I don't know" responses. A novel evaluation metric, the Distributional Correctness Score (DCS), is introduced to solve this problem, i.e., of not considering a model's entire probability distribution over answer choices. DCS naturally distinguishes between harmful overconfidence in wrong answers and uncertainty expressed through abstention, providing scores in an interpretable default range. Through theoretical analysis and illustrative examples, DCS is demonstrated to offer a more nuanced and aligned evaluation paradigm that incentivises models to express genuine uncertainty rather than guessing. Adapting 12 existing evaluation benchmarks to DCS's variants and measuring performance on six language models reveals that for half of the tested benchmarks scores are negative across all tested models, indicating significant tendencies towards hallucination.

Paper Structure

This paper contains 34 sections, 8 theorems, 17 equations, 2 figures, 6 tables.

Key Result

Theorem 1

DCS, as per Definition def:DCS, under the assumption of default loadings ($l_c = l_w = 1$), is bounded in the range $[-1,1]$ for any valid probability distribution. Let $\pi = (p_c, P_W, p_{\textsc{idk}})$ be such a distribution. Consider the following three canonical distributions: Then, $\text{DCS}(\pi_{CI}) < \text{DCS}(\pi_{HA}) < \text{DCS}(\pi_{CC})$.

Figures (2)

  • Figure 1: Values of the optimal guessing threshold in DCS, as given by Proposition \ref{['prop:optimal-guessing-threshold']} for different values of the parameters $l_c$ and $l_w$.
  • Figure 2: Mean MMLU scores as measured by accuracy, confidence-weighted accuracy, ternary score, and DCS across the evaluated models. All scores are multiplied by $100$ for readability. Error bars represent the standard error (S.E.).

Theorems & Definitions (15)

  • Definition 1: Distributional Correctness Score
  • Example 1: Error-Hedging vs. Abstention-Hedging.
  • Example 2: Lucky Guesses vs. Confident Knowledge.
  • Theorem 1: Score Bounds & Incentive Ordering
  • Corollary 1: Preference for Abstention-Hedging
  • Proposition 1: Optimal Guessing Threshold
  • Proposition 2: Information-Theoretic Performance Bound
  • Theorem 1: Score Bounds & Incentive Ordering
  • proof
  • Corollary 1: Preference for Abstention-Hedging
  • ...and 5 more