Table of Contents
Fetching ...

Estimating Semantic Alphabet Size for LLM Uncertainty Quantification

Lucas H. McCabe, Rimon Melamed, Thomas Hartvigsen, H. Howie Huang

TL;DR

The paper tackles the challenge of estimating uncertainty in large language models under black-box constraints, where extensive sampling is costly and internal model signals may be unavailable. It demonstrates that canonical discrete semantic entropy (DSE) underestimates the true semantic entropy ($SE$) at practical sample sizes and proposes a novel, interpretable approach: estimate the semantic alphabet size and adjust SE for sample coverage using a hybrid estimator that blends Good-Turing and spectral methods. The results show the coverage-adjusted estimator substantially reduces bias and improves incorrectness detection compared with other explicit SE estimators, while alphabet-size estimators such as the Hybrid and $U_{EigV}$ often outperform many baselines in ranking methods under uncertainty. This work offers a practical, interpretable path to reliable LLM uncertainty quantification with limited samples, relevant for risk-sensitive deployment and diagnostic tooling.

Abstract

Many black-box techniques for quantifying the uncertainty of large language models (LLMs) rely on repeated LLM sampling, which can be computationally expensive. Therefore, practical applicability demands reliable estimation from few samples. Semantic entropy (SE) is a popular sample-based uncertainty estimator with a discrete formulation attractive for the black-box setting. Recent extensions of SE exhibit improved LLM hallucination detection, but do so with less interpretable methods that admit additional hyperparameters. For this reason, we revisit the canonical discrete semantic entropy (DSE) estimator, finding that it underestimates the "true" semantic entropy, as expected from theory. We propose a modified semantic alphabet size estimator, and illustrate that using it to adjust DSE for sample coverage results in more accurate SE estimation in our setting of interest. Furthermore, we find that two semantic alphabet size estimators, including our proposed, flag incorrect LLM responses as well or better than many top-performing alternatives, with the added benefit of remaining highly interpretable.

Estimating Semantic Alphabet Size for LLM Uncertainty Quantification

TL;DR

The paper tackles the challenge of estimating uncertainty in large language models under black-box constraints, where extensive sampling is costly and internal model signals may be unavailable. It demonstrates that canonical discrete semantic entropy (DSE) underestimates the true semantic entropy () at practical sample sizes and proposes a novel, interpretable approach: estimate the semantic alphabet size and adjust SE for sample coverage using a hybrid estimator that blends Good-Turing and spectral methods. The results show the coverage-adjusted estimator substantially reduces bias and improves incorrectness detection compared with other explicit SE estimators, while alphabet-size estimators such as the Hybrid and often outperform many baselines in ranking methods under uncertainty. This work offers a practical, interpretable path to reliable LLM uncertainty quantification with limited samples, relevant for risk-sensitive deployment and diagnostic tooling.

Abstract

Many black-box techniques for quantifying the uncertainty of large language models (LLMs) rely on repeated LLM sampling, which can be computationally expensive. Therefore, practical applicability demands reliable estimation from few samples. Semantic entropy (SE) is a popular sample-based uncertainty estimator with a discrete formulation attractive for the black-box setting. Recent extensions of SE exhibit improved LLM hallucination detection, but do so with less interpretable methods that admit additional hyperparameters. For this reason, we revisit the canonical discrete semantic entropy (DSE) estimator, finding that it underestimates the "true" semantic entropy, as expected from theory. We propose a modified semantic alphabet size estimator, and illustrate that using it to adjust DSE for sample coverage results in more accurate SE estimation in our setting of interest. Furthermore, we find that two semantic alphabet size estimators, including our proposed, flag incorrect LLM responses as well or better than many top-performing alternatives, with the added benefit of remaining highly interpretable.

Paper Structure

This paper contains 56 sections, 13 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: High-level schematic of semantic alphabet size estimation for LLM uncertainty quantification (Section \ref{['sec:estimators']}). (A) Generate LLM responses to a query. (B) Assign responses to categories of shared meaning. (C) Estimate semantic alphabet size, accounting for semantic classes unobserved in the sample (Equation \ref{['eq:hybrid-alphabet']}). LLM response examples are hypothetical for illustrative purposes.
  • Figure 2: Illustrating underestimation in discrete semantic entropy (DSE) calculation with typical sample sizes. The ratios of DSE estimators with varying sample size ($n = 5, 10, 25, 50, 75, 100$) to white-box SE with $n=100$ (denoted $SE^*$) are shown, with values below $1$ suggesting underestimation (dotted grey line). The estimators displayed are the plugin estimator of canonical DSE (i.e., Equation \ref{['eq:plugin_dse']}, dotted indigo line) and the "hybrid" DSE estimator of Equation \ref{['eq:cs-hybrid']} (solid indigo line). Results are averaged across queries within each dataset, then uniformly averaged across datasets. Log scale is used on the x-axis to highlight differences between estimators with smaller sample sizes. Instances with a denominator of $0$ are ignored.
  • Figure 3: Establishing overall performance of ten UQ methods on incorrectness detection. (A) Bradley-Terry latent strength scores from pairwise comparison of AUROC point estimates. (B) Bradley-Terry latent strength scores after accounting for uncertainty in estimating AUROC. Error bars are "conservative" CIs about strength scores, which may be slightly stricter than $95\%$; see Section \ref{['sec:bradley_terry']} for details. (C) For each method, we establish $95\%$ CIs about the rank of Bradley-Terry latent strength MLEs gao2023uncertainty for the incorrectness detection task; see Section \ref{['sec:bradley_terry']} for details. We highlight (i) semantic alphabet size estimators, (ii) black-box discrete semantic entropy (DSE) estimators, and (iii) other uncertainty estimators, which include white-box (PE, SE) and black-box methods (SNNE, KLE). The interval [a, b] denotes all integers from $a$ to $b$, inclusively. The CI for SNNE reflects [6, 6] and is extended for readability.
  • Figure 4: Heatmap illustrating the proportion of model-dataset pairs, rounded to the nearest integer, for which a row's method achieved a larger AUROC point estimate than a column's method. Uncertainty methods are organized into three groups: (i) semantic alphabet size estimators, (ii) black-box discrete semantic entropy (DSE) estimators, and (iii) other uncertainty estimators, which include white-box (PE, SE) and black-box methods (SNNE, KLE). The hybrid discrete semantic entropy (DSE) estimator consistently outperforms other explicit SE estimators.
  • Figure 5: (A) Empirical survival function of the number of possible correct semantic categories and the number of observed semantic categories in the responses generated by GPT-4o-mini, according to a human annotator. (B) Scattergram of the number of observed semantic categories ("Observed"), according to the human annotator, against the number of possible correct semantic categories ("Possible"). One question is excluded ("Identify a programming language designed by Microsoft."), because the total number of possible correct semantic categories was not known by the authors of this work.
  • ...and 6 more figures