Table of Contents
Fetching ...

Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks

Kevin Wang, Subre Abdoul Moktar, Jia Li, Kangshuo Li, Feng Chen

TL;DR

The study tackles reliable uncertainty quantification in LLM QA by empirically comparing twelve UE methods across in-distribution (ID) tasks like CoQA and bAbIQA and an out-of-distribution (OOD) task (ALCUNA). It introduces a multi-metric evaluation framework, including $Rouge-L$, $BERTScore$, $LLMScore$, and $HumanEval$, with $LLMScore$ computed using backbones such as $Gemma$ and $Vicuna$, and analyzes performance through the PRR criterion and AUCPR. A key finding is that information-based UE methods excel for aleatoric uncertainty in ID, while density-based methods and the reflexive $P(True)$ measure better capture epistemic uncertainty in OOD contexts; semantic consistency methods provide generally robust but dataset-dependent performance. The results offer practical guidance for selecting UE methods by uncertainty type and data context, and point to directions for future work, including testing newer models like $llama$-3-8b-Instruct and broadening OOD benchmarks.

Abstract

Large Language Models (LLMs) have become increasingly pervasive, finding applications across many industries and disciplines. Ensuring the trustworthiness of LLM outputs is paramount, where Uncertainty Estimation (UE) plays a key role. In this work, a comprehensive empirical study is conducted to examine the robustness and effectiveness of diverse UE measures regarding aleatoric and epistemic uncertainty in LLMs. It involves twelve different UE methods and four generation quality metrics including LLMScore from LLM criticizers to evaluate the uncertainty of LLM-generated answers in Question-Answering (QA) tasks on both in-distribution (ID) and out-of-distribution (OOD) datasets. Our analysis reveals that information-based methods, which leverage token and sequence probabilities, perform exceptionally well in ID settings due to their alignment with the model's understanding of the data. Conversely, density-based methods and the P(True) metric exhibit superior performance in OOD contexts, highlighting their effectiveness in capturing the model's epistemic uncertainty. Semantic consistency methods, which assess variability in generated answers, show reliable performance across different datasets and generation metrics. These methods generally perform well but may not be optimal for every situation.

Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks

TL;DR

The study tackles reliable uncertainty quantification in LLM QA by empirically comparing twelve UE methods across in-distribution (ID) tasks like CoQA and bAbIQA and an out-of-distribution (OOD) task (ALCUNA). It introduces a multi-metric evaluation framework, including , , , and , with computed using backbones such as and , and analyzes performance through the PRR criterion and AUCPR. A key finding is that information-based UE methods excel for aleatoric uncertainty in ID, while density-based methods and the reflexive measure better capture epistemic uncertainty in OOD contexts; semantic consistency methods provide generally robust but dataset-dependent performance. The results offer practical guidance for selecting UE methods by uncertainty type and data context, and point to directions for future work, including testing newer models like -3-8b-Instruct and broadening OOD benchmarks.

Abstract

Large Language Models (LLMs) have become increasingly pervasive, finding applications across many industries and disciplines. Ensuring the trustworthiness of LLM outputs is paramount, where Uncertainty Estimation (UE) plays a key role. In this work, a comprehensive empirical study is conducted to examine the robustness and effectiveness of diverse UE measures regarding aleatoric and epistemic uncertainty in LLMs. It involves twelve different UE methods and four generation quality metrics including LLMScore from LLM criticizers to evaluate the uncertainty of LLM-generated answers in Question-Answering (QA) tasks on both in-distribution (ID) and out-of-distribution (OOD) datasets. Our analysis reveals that information-based methods, which leverage token and sequence probabilities, perform exceptionally well in ID settings due to their alignment with the model's understanding of the data. Conversely, density-based methods and the P(True) metric exhibit superior performance in OOD contexts, highlighting their effectiveness in capturing the model's epistemic uncertainty. Semantic consistency methods, which assess variability in generated answers, show reliable performance across different datasets and generation metrics. These methods generally perform well but may not be optimal for every situation.

Paper Structure

This paper contains 9 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Ranking of UE Methods on CoQA, using LLMScore with Gemma.
  • Figure 2: Ranking of UE Methods on bAbIQA, using LLMScore with Gemma.
  • Figure 3: Ranking of UE Methods on ALCUNA, using LLMScore with Gemma.
  • Figure 4: An example of P(True) estimating high uncertainty on a correct LLM answer in CoQA. 29.303 is considered a high value among the values collected.
  • Figure 5: An example of P(True) estimating high uncertainty on an incorrect LLM answer in ALCUNA despite LLMScore indicating high correctness.