Table of Contents
Fetching ...

LUQ: Long-text Uncertainty Quantification for LLMs

Caiqi Zhang, Fangyu Liu, Marco Basaldella, Nigel Collier

TL;DR

This work tackles uncertainty quantification for long-form LLM outputs, identifying limitations of existing UQ methods when handling extended text. It introduces Luq, a sentence-level, sampling-based uncertainty metric, along with Luq-Atomic and Luq-Pair variants, and demonstrates strong negative correlations with factuality across six LLMs on the FActScore datasets. It further proposes Luq-Ensemble and selective question answering to boost factuality, showing meaningful improvements in real-world, long-form scenarios. The study provides practical tools to enhance reliability of AI-generated long-form content and lays groundwork for future long-text UQ research and domain-specific datasets.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capability in a variety of NLP tasks. However, LLMs are also prone to generate nonfactual content. Uncertainty Quantification (UQ) is pivotal in enhancing our understanding of a model's confidence on its generation, thereby aiding in the mitigation of nonfactual outputs. Existing research on UQ predominantly targets short text generation, typically yielding brief, word-limited responses. However, real-world applications frequently necessitate much longer responses. Our study first highlights the limitations of current UQ methods in handling long text generation. We then introduce \textsc{Luq} and its two variations, a series of novel sampling-based UQ approaches specifically designed for long text. Our findings reveal that \textsc{Luq} outperforms existing baseline methods in correlating with the model's factuality scores (negative coefficient of -0.85 observed for Gemini Pro). To further improve the factuality of LLM responses, we propose \textsc{Luq-Ensemble}, a method that ensembles responses from multiple models and selects the response with the lowest uncertainty. The ensembling method greatly improves the response factuality upon the best standalone LLM.

LUQ: Long-text Uncertainty Quantification for LLMs

TL;DR

This work tackles uncertainty quantification for long-form LLM outputs, identifying limitations of existing UQ methods when handling extended text. It introduces Luq, a sentence-level, sampling-based uncertainty metric, along with Luq-Atomic and Luq-Pair variants, and demonstrates strong negative correlations with factuality across six LLMs on the FActScore datasets. It further proposes Luq-Ensemble and selective question answering to boost factuality, showing meaningful improvements in real-world, long-form scenarios. The study provides practical tools to enhance reliability of AI-generated long-form content and lays groundwork for future long-text UQ research and domain-specific datasets.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capability in a variety of NLP tasks. However, LLMs are also prone to generate nonfactual content. Uncertainty Quantification (UQ) is pivotal in enhancing our understanding of a model's confidence on its generation, thereby aiding in the mitigation of nonfactual outputs. Existing research on UQ predominantly targets short text generation, typically yielding brief, word-limited responses. However, real-world applications frequently necessitate much longer responses. Our study first highlights the limitations of current UQ methods in handling long text generation. We then introduce \textsc{Luq} and its two variations, a series of novel sampling-based UQ approaches specifically designed for long text. Our findings reveal that \textsc{Luq} outperforms existing baseline methods in correlating with the model's factuality scores (negative coefficient of -0.85 observed for Gemini Pro). To further improve the factuality of LLM responses, we propose \textsc{Luq-Ensemble}, a method that ensembles responses from multiple models and selects the response with the lowest uncertainty. The ensembling method greatly improves the response factuality upon the best standalone LLM.
Paper Structure (31 sections, 8 equations, 7 figures, 10 tables)

This paper contains 31 sections, 8 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The illustration of the Luq and Luq-Ensemble framework. Given a question, various LLMs exhibit differing levels of uncertainty. We generate $n$ sample responses from each LLM and then assess the uncertainty based on the diversity of these samples (the Luq metric). Green highlights indicate consistency across responses (low uncertainty) and red highlights discrepancies (high uncertainty). The Luq-Ensemble method selects the response from the LLM with the lowest uncertainty score as the final answer.
  • Figure 2: Scatter plot illustrating the relationship between factuality scores (x-axis) and uncertainty scores (y-axis) for different LLMs. Each point symbolizes an item in the FactScore dataset, with a red line highlighting the Pearson correlation. The distribution suggests a pattern where higher factuality correlates with lower uncertainty.
  • Figure 3: Factuality and uncertainty scores across different frequencies on FActScore-Bio.
  • Figure 4: Average number of atomic Facts (AF) in a response for each model.
  • Figure 5: The effect of different temperatures (upper) and the number of samples (lower) on the PCC with Luq.
  • ...and 2 more figures