Estimating Semantic Alphabet Size for LLM Uncertainty Quantification
Lucas H. McCabe, Rimon Melamed, Thomas Hartvigsen, H. Howie Huang
TL;DR
The paper tackles the challenge of estimating uncertainty in large language models under black-box constraints, where extensive sampling is costly and internal model signals may be unavailable. It demonstrates that canonical discrete semantic entropy (DSE) underestimates the true semantic entropy ($SE$) at practical sample sizes and proposes a novel, interpretable approach: estimate the semantic alphabet size and adjust SE for sample coverage using a hybrid estimator that blends Good-Turing and spectral methods. The results show the coverage-adjusted estimator substantially reduces bias and improves incorrectness detection compared with other explicit SE estimators, while alphabet-size estimators such as the Hybrid and $U_{EigV}$ often outperform many baselines in ranking methods under uncertainty. This work offers a practical, interpretable path to reliable LLM uncertainty quantification with limited samples, relevant for risk-sensitive deployment and diagnostic tooling.
Abstract
Many black-box techniques for quantifying the uncertainty of large language models (LLMs) rely on repeated LLM sampling, which can be computationally expensive. Therefore, practical applicability demands reliable estimation from few samples. Semantic entropy (SE) is a popular sample-based uncertainty estimator with a discrete formulation attractive for the black-box setting. Recent extensions of SE exhibit improved LLM hallucination detection, but do so with less interpretable methods that admit additional hyperparameters. For this reason, we revisit the canonical discrete semantic entropy (DSE) estimator, finding that it underestimates the "true" semantic entropy, as expected from theory. We propose a modified semantic alphabet size estimator, and illustrate that using it to adjust DSE for sample coverage results in more accurate SE estimation in our setting of interest. Furthermore, we find that two semantic alphabet size estimators, including our proposed, flag incorrect LLM responses as well or better than many top-performing alternatives, with the added benefit of remaining highly interpretable.
