Table of Contents
Fetching ...

A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions

Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z. Ren, Anirudha Majumdar

TL;DR

This survey addresses the reliability challenge of LLMs by organizing uncertainty quantification methods into four categories tailored to LLMs: token-level, self-verbalized, semantic-similarity, and mechanistic interpretability. It surveys architectures, NLI-based techniques, calibration strategies, and both white-box and black-box metrics, linking uncertainty to factuality and hallucination detection. It also covers datasets, benchmarks, and diverse applications including robotics and interactive agents, and outlines open research challenges and future directions. The work aims to facilitate safer, more trustworthy deployment of LLMs by providing a cohesive framework and actionable guidance for researchers and practitioners.

Abstract

The remarkable performance of large language models (LLMs) in content generation, coding, and common-sense reasoning has spurred widespread integration into many facets of society. However, integration of LLMs raises valid questions on their reliability and trustworthiness, given their propensity to generate hallucinations: plausible, factually-incorrect responses, which are expressed with striking confidence. Previous work has shown that hallucinations and other non-factual responses generated by LLMs can be detected by examining the uncertainty of the LLM in its response to the pertinent prompt, driving significant research efforts devoted to quantifying the uncertainty of LLMs. This survey seeks to provide an extensive review of existing uncertainty quantification methods for LLMs, identifying their salient features, along with their strengths and weaknesses. We present existing methods within a relevant taxonomy, unifying ostensibly disparate methods to aid understanding of the state of the art. Furthermore, we highlight applications of uncertainty quantification methods for LLMs, spanning chatbot and textual applications to embodied artificial intelligence applications in robotics. We conclude with open research challenges in uncertainty quantification of LLMs, seeking to motivate future research.

A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions

TL;DR

This survey addresses the reliability challenge of LLMs by organizing uncertainty quantification methods into four categories tailored to LLMs: token-level, self-verbalized, semantic-similarity, and mechanistic interpretability. It surveys architectures, NLI-based techniques, calibration strategies, and both white-box and black-box metrics, linking uncertainty to factuality and hallucination detection. It also covers datasets, benchmarks, and diverse applications including robotics and interactive agents, and outlines open research challenges and future directions. The work aims to facilitate safer, more trustworthy deployment of LLMs by providing a cohesive framework and actionable guidance for researchers and practitioners.

Abstract

The remarkable performance of large language models (LLMs) in content generation, coding, and common-sense reasoning has spurred widespread integration into many facets of society. However, integration of LLMs raises valid questions on their reliability and trustworthiness, given their propensity to generate hallucinations: plausible, factually-incorrect responses, which are expressed with striking confidence. Previous work has shown that hallucinations and other non-factual responses generated by LLMs can be detected by examining the uncertainty of the LLM in its response to the pertinent prompt, driving significant research efforts devoted to quantifying the uncertainty of LLMs. This survey seeks to provide an extensive review of existing uncertainty quantification methods for LLMs, identifying their salient features, along with their strengths and weaknesses. We present existing methods within a relevant taxonomy, unifying ostensibly disparate methods to aid understanding of the state of the art. Furthermore, we highlight applications of uncertainty quantification methods for LLMs, spanning chatbot and textual applications to embodied artificial intelligence applications in robotics. We conclude with open research challenges in uncertainty quantification of LLMs, seeking to motivate future research.

Paper Structure

This paper contains 36 sections, 1 equation, 18 figures.

Figures (18)

  • Figure 1: A user asks an LLM the question: What is the lowest-ever temperature recorded in Antarctica?; in response, the LLM answers definitively. Afterwards, the user asks the LLM how confident the LLM is. Although the LLM states that it is "100% confident," the LLM's response fails to pass a fact-check test. Confidence scores provided by LLMs are generally miscalibrated. UQ methods seek to provide calibrated estimates of the confidence of LLMs in their interaction with users.
  • Figure 3: Hallucination in LLMs: When asked for information about a possibly fictional person, LLMs tend to fabricate a response that sounds coherent but is entirely false.
  • Figure 4: Hallucination in LLMs: When asked about its confidence, the LLM apologizes before hallucinating another response. The Jewish Cookbook is authored by Leah Koenig, not Jaime Feldman.
  • Figure 5: Uncertainty quantification methods in deep learning span the spectrum from training-based methods to training-free methods.
  • Figure 6: Many state-of-the-art LLMs are decoder-only transformers, with $N$ multi-head attention sub-blocks, for auto-regressive output generation.
  • ...and 13 more figures