Table of Contents
Fetching ...

On the Role of Unobserved Sequences on Sample-based Uncertainty Quantification for LLMs

Lucie Kunitomo-Jacquin, Edison Marrese-Taylor, Ken Fukuda

TL;DR

The paper tackles the challenge of uncertainty quantification for large language models by highlighting that standard entropy-based metrics miss the probability mass of unobserved sequences. It introduces the unobserved-probability concept, $\mathbb{P}(\bar{A}|x)$, and presents two practical variants, EOS-UP and LN-UP, to incorporate missing mass into UQ computed from sampled outputs. Through experiments on Falcon-40B-Instruct with TriviaQA, EOS-UP achieves AUROC performance comparable to predictive entropy and demonstrates robustness when the number of samples $M$ is small, while LN-UP underperforms. The work suggests integrating unobserved probability into existing UQ frameworks, potentially via evidential theories, to more comprehensively capture epistemic and aleatoric uncertainty in LLM outputs.

Abstract

Quantifying uncertainty in large language models (LLMs) is important for safety-critical applications because it helps spot incorrect answers, known as hallucinations. One major trend of uncertainty quantification methods is based on estimating the entropy of the distribution of the LLM's potential output sequences. This estimation is based on a set of output sequences and associated probabilities obtained by querying the LLM several times. In this paper, we advocate and experimentally show that the probability of unobserved sequences plays a crucial role, and we recommend future research to integrate it to enhance such LLM uncertainty quantification methods.

On the Role of Unobserved Sequences on Sample-based Uncertainty Quantification for LLMs

TL;DR

The paper tackles the challenge of uncertainty quantification for large language models by highlighting that standard entropy-based metrics miss the probability mass of unobserved sequences. It introduces the unobserved-probability concept, , and presents two practical variants, EOS-UP and LN-UP, to incorporate missing mass into UQ computed from sampled outputs. Through experiments on Falcon-40B-Instruct with TriviaQA, EOS-UP achieves AUROC performance comparable to predictive entropy and demonstrates robustness when the number of samples is small, while LN-UP underperforms. The work suggests integrating unobserved probability into existing UQ frameworks, potentially via evidential theories, to more comprehensively capture epistemic and aleatoric uncertainty in LLM outputs.

Abstract

Quantifying uncertainty in large language models (LLMs) is important for safety-critical applications because it helps spot incorrect answers, known as hallucinations. One major trend of uncertainty quantification methods is based on estimating the entropy of the distribution of the LLM's potential output sequences. This estimation is based on a set of output sequences and associated probabilities obtained by querying the LLM several times. In this paper, we advocate and experimentally show that the probability of unobserved sequences plays a crucial role, and we recommend future research to integrate it to enhance such LLM uncertainty quantification methods.

Paper Structure

This paper contains 10 sections, 5 equations, 3 figures.

Figures (3)

  • Figure 1: Example of tree of possible sequences with token conditional probabilities.
  • Figure 2: Influence of the number of samples ($M$) for the LLM uncertainty quantification in terms of AUROC, for the short (top) and normal (bottom) answer length scenarios. We compare the performance of our proposed approach variations (UP) against relevant baselines. Results were computed on $500$ pairs of questions and ground truth answers on the falcon-40b-instruct model.
  • Figure 3: Prompts fed to the model in our experiments when providing a single (top) and many correct answers (bottom), where placeholders are denoted in bold.