A Survey of Uncertainty Estimation Methods on Large Language Models
Zhiqiu Xia, Jinxuan Xu, Yuqian Zhang, Hang Liu
TL;DR
This survey addresses uncertainty estimation in large language models (LLMs) during inference to mitigate biased and hallucinated outputs. It introduces a four-class taxonomy of methods—verbalizing, latent information, consistency-based, and semantic clustering—rooted in the autoregressive token distribution $p_i$ and its variance, formalized as $p_i = Softmax(f(\mathbf{x}, \mathbf{r}_{<i}))$, with uncertainty reflected through sampling, prompts, and aggregation. Through extensive experiments across datasets such as TruthfulQA, SciQ, TriviaQA, GSM8K, and SimpleQA, the paper benchmarks representative methods using metrics like AUROC and AUARC, revealing strengths for latent information and semantic clustering approaches and highlighting limitations of verbalizing in practice. The findings provide guidance for practical deployment and underscore the need for dedicated uncertainty benchmarks and methods capable of handling long-form, multi-step reasoning in real-world settings.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, these models could offer biased, hallucinated, or non-factual responses camouflaged by their fluency and realistic appearance. Uncertainty estimation is the key method to address this challenge. While research efforts in uncertainty estimation are ramping up, there is a lack of comprehensive and dedicated surveys on LLM uncertainty estimation. This survey presents four major avenues of LLM uncertainty estimation. Furthermore, we perform extensive experimental evaluations across multiple methods and datasets. At last, we provide critical and promising future directions for LLM uncertainty estimation.
