Table of Contents
Fetching ...

A Survey of Uncertainty Estimation Methods on Large Language Models

Zhiqiu Xia, Jinxuan Xu, Yuqian Zhang, Hang Liu

TL;DR

This survey addresses uncertainty estimation in large language models (LLMs) during inference to mitigate biased and hallucinated outputs. It introduces a four-class taxonomy of methods—verbalizing, latent information, consistency-based, and semantic clustering—rooted in the autoregressive token distribution $p_i$ and its variance, formalized as $p_i = Softmax(f(\mathbf{x}, \mathbf{r}_{<i}))$, with uncertainty reflected through sampling, prompts, and aggregation. Through extensive experiments across datasets such as TruthfulQA, SciQ, TriviaQA, GSM8K, and SimpleQA, the paper benchmarks representative methods using metrics like AUROC and AUARC, revealing strengths for latent information and semantic clustering approaches and highlighting limitations of verbalizing in practice. The findings provide guidance for practical deployment and underscore the need for dedicated uncertainty benchmarks and methods capable of handling long-form, multi-step reasoning in real-world settings.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, these models could offer biased, hallucinated, or non-factual responses camouflaged by their fluency and realistic appearance. Uncertainty estimation is the key method to address this challenge. While research efforts in uncertainty estimation are ramping up, there is a lack of comprehensive and dedicated surveys on LLM uncertainty estimation. This survey presents four major avenues of LLM uncertainty estimation. Furthermore, we perform extensive experimental evaluations across multiple methods and datasets. At last, we provide critical and promising future directions for LLM uncertainty estimation.

A Survey of Uncertainty Estimation Methods on Large Language Models

TL;DR

This survey addresses uncertainty estimation in large language models (LLMs) during inference to mitigate biased and hallucinated outputs. It introduces a four-class taxonomy of methods—verbalizing, latent information, consistency-based, and semantic clustering—rooted in the autoregressive token distribution and its variance, formalized as , with uncertainty reflected through sampling, prompts, and aggregation. Through extensive experiments across datasets such as TruthfulQA, SciQ, TriviaQA, GSM8K, and SimpleQA, the paper benchmarks representative methods using metrics like AUROC and AUARC, revealing strengths for latent information and semantic clustering approaches and highlighting limitations of verbalizing in practice. The findings provide guidance for practical deployment and underscore the need for dedicated uncertainty benchmarks and methods capable of handling long-form, multi-step reasoning in real-world settings.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, these models could offer biased, hallucinated, or non-factual responses camouflaged by their fluency and realistic appearance. Uncertainty estimation is the key method to address this challenge. While research efforts in uncertainty estimation are ramping up, there is a lack of comprehensive and dedicated surveys on LLM uncertainty estimation. This survey presents four major avenues of LLM uncertainty estimation. Furthermore, we perform extensive experimental evaluations across multiple methods and datasets. At last, we provide critical and promising future directions for LLM uncertainty estimation.

Paper Structure

This paper contains 25 sections, 18 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Illustration of uncertainty estimation.
  • Figure 2: Illustration of uncertainty versus confidence.
  • Figure 3: Taxonomy of uncertainty estimation methods on LLMs.
  • Figure 4: Illustration on verbalizing methods.
  • Figure 5: Illustration on latent information methods.
  • ...and 5 more figures