Table of Contents
Fetching ...

Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review

Toghrul Abbasli, Kentaroh Toyoda, Yuan Wang, Leon Witt, Muhammad Asif Ali, Yukai Miao, Dan Li, Qingsong Wei

Abstract

Large Language Models (LLMs) have been transformative across many domains. However, hallucination, i.e., confidently outputting incorrect information, remains one of the leading challenges for LLMs. This raises the question of how to accurately assess and quantify the uncertainty of LLMs. Extensive literature on traditional models has explored Uncertainty Quantification (UQ) to measure uncertainty and employed calibration techniques to address the misalignment between uncertainty and accuracy. While some of these methods have been adapted for LLMs, the literature lacks an in-depth analysis of their effectiveness and does not offer a comprehensive benchmark to enable insightful comparison among existing solutions. In this work, we fill this gap via a systematic survey of representative prior works on UQ and calibration for LLMs and introduce a rigorous benchmark. Using two widely used reliability datasets, we empirically evaluate six related methods, which justify the significant findings of our review. Finally, we provide outlooks for key future directions and outline open challenges. To the best of our knowledge, this survey is the first dedicated study to review the calibration methods and relevant metrics for LLMs.

Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review

Abstract

Large Language Models (LLMs) have been transformative across many domains. However, hallucination, i.e., confidently outputting incorrect information, remains one of the leading challenges for LLMs. This raises the question of how to accurately assess and quantify the uncertainty of LLMs. Extensive literature on traditional models has explored Uncertainty Quantification (UQ) to measure uncertainty and employed calibration techniques to address the misalignment between uncertainty and accuracy. While some of these methods have been adapted for LLMs, the literature lacks an in-depth analysis of their effectiveness and does not offer a comprehensive benchmark to enable insightful comparison among existing solutions. In this work, we fill this gap via a systematic survey of representative prior works on UQ and calibration for LLMs and introduce a rigorous benchmark. Using two widely used reliability datasets, we empirically evaluate six related methods, which justify the significant findings of our review. Finally, we provide outlooks for key future directions and outline open challenges. To the best of our knowledge, this survey is the first dedicated study to review the calibration methods and relevant metrics for LLMs.

Paper Structure

This paper contains 22 sections, 2 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: A classification of UQ and calibration methods for LLMs. Abbr.: Randomized Utility-driven Synthesis of Uncertain REgions (R-U-SURE), sampling with perturbation for UQ (SPUQ), Uncertainty-Aware Beam Search (UABS), Instruction Tuning (IT), Uncertainty-Aware Language Agent (UALA), Hybrid uncertainty quantification (HUQ), Self-Correcting with Tool-Interactive Critiquing (CRITIC), Human Distribution Calibration Error (DistCE), Length-normalized predictive entropy (LNPE), Predictive entropy (PE), Clustering and Pruning for Efficient Black-box Prompt Search (CLAPS), Iterative Grouped Histogram Binning (IGHB), Sequence Likelihoood Calibration (SLiC), Refusal-Aware Instruction Tuning (R-tuning), Listener-Aware Calibration for Implicit and Explicit confidence (LACIE), Statement Accuracy Prediction, based on Language Model Activations (SAPLMA), Self-consistency (SC), Universal self-consistency (USC), Label-smoothing (LS), Fact-and-Reflection (FaR), Uncertainty-aware Instruction Tuning (UAIT), Uncertainty Tripartite Testing Paradigm (Unc-TTP), Semantically Diverse Language Generation (SDLG), Kernel Language Entropy (KLE), Semantic Entropy (SE), Semantic Density (SD), Local Intrinsic Dimensions (LID), Multi-agent debate (MAD), Word-Sequence Entropy (WSE), Distance Dependent Chinese Restaurant Process (DDCRP), Consistent-and-Inconsistent (CAI) Ratio.
  • Figure 2: Reliability diagrams for calibration methods with 10 bins on TriviaQA. Top row: open-box models (Llama 3.1 8B, Qwen2.5 7B, Mistral v0.3 7B). Bottom row: closed-box models (GPT-4, GPT-4o, GPT-5.2, DeepSeek-R1). The color and the percentage number on each bar indicate the proportion of total points contained in each bin.