Table of Contents
Fetching ...

Graph-based Uncertainty Metrics for Long-form Language Model Outputs

Mingjian Jiang, Yangjun Ruan, Prasanna Sattigeri, Salim Roukos, Tatsunori Hashimoto

TL;DR

Graph Uncertainty is proposed -- which represents the relationship between LLM generations and claims within them as a bipartite graph and estimates the claim-level uncertainty with a family of graph centrality metrics.

Abstract

Recent advancements in Large Language Models (LLMs) have significantly improved text generation capabilities, but these systems are still known to hallucinate, and granular uncertainty estimation for long-form LLM generations remains challenging. In this work, we propose Graph Uncertainty -- which represents the relationship between LLM generations and claims within them as a bipartite graph and estimates the claim-level uncertainty with a family of graph centrality metrics. Under this view, existing uncertainty estimation methods based on the concept of self-consistency can be viewed as using degree centrality as an uncertainty measure, and we show that more sophisticated alternatives such as closeness centrality provide consistent gains at claim-level uncertainty estimation. Moreover, we present uncertainty-aware decoding techniques that leverage both the graph structure and uncertainty estimates to improve the factuality of LLM generations by preserving only the most reliable claims. Compared to existing methods, our graph-based uncertainty metrics lead to an average of 6.8% relative gains on AUPRC across various long-form generation settings, and our end-to-end system provides consistent 2-4% gains in factuality over existing decoding techniques while significantly improving the informativeness of generated responses.

Graph-based Uncertainty Metrics for Long-form Language Model Outputs

TL;DR

Graph Uncertainty is proposed -- which represents the relationship between LLM generations and claims within them as a bipartite graph and estimates the claim-level uncertainty with a family of graph centrality metrics.

Abstract

Recent advancements in Large Language Models (LLMs) have significantly improved text generation capabilities, but these systems are still known to hallucinate, and granular uncertainty estimation for long-form LLM generations remains challenging. In this work, we propose Graph Uncertainty -- which represents the relationship between LLM generations and claims within them as a bipartite graph and estimates the claim-level uncertainty with a family of graph centrality metrics. Under this view, existing uncertainty estimation methods based on the concept of self-consistency can be viewed as using degree centrality as an uncertainty measure, and we show that more sophisticated alternatives such as closeness centrality provide consistent gains at claim-level uncertainty estimation. Moreover, we present uncertainty-aware decoding techniques that leverage both the graph structure and uncertainty estimates to improve the factuality of LLM generations by preserving only the most reliable claims. Compared to existing methods, our graph-based uncertainty metrics lead to an average of 6.8% relative gains on AUPRC across various long-form generation settings, and our end-to-end system provides consistent 2-4% gains in factuality over existing decoding techniques while significantly improving the informativeness of generated responses.

Paper Structure

This paper contains 46 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Graph Uncertainty for claim-level uncertainty estimation. We first sample several responses from LLMs (a) and decompose each response into atomic claims (b) following \ref{['sec:method_graph_construction']}. The key components are the construction of a bipartite graph that captures the relations between responses and claims (c) and the use of graph centrality metrics to estimate the uncertainty of each claim. We simplify the pipeline and prompt for presentation, see \ref{['appendix:prompt_details']} for details.
  • Figure 2: Our uncertainty-aware decoding framework. Based on our claim-wise uncertainty estimates obtained from \ref{['fig:graph_estimate']}, we keep low-uncertainty claims above a certain confidence threshold and use LLMs to synthesize them into a coherent response. Varying the threshold enables us to balance factuality and informativeness.
  • Figure 3: UAD with better claim-level uncertainty estimates demonstrates a better trade-off between factuality and informativeness of the generated responses. We compare UAD across different thresholds $\delta$ and two non-uncertainty decoding baselines. We assume that random noise is applied to break ties for each uncertainty method, resulting in a horizontal line extending from the leftmost dot to the left. The shaded confidence interval is obtained by bootstrapping with a confidence level of 95%.
  • Figure 4: Ablation study: (a) The false claims have a greater average distance to other claims compared to true ones, indicating the effectiveness of the closeness centrality metric. (b) Performance improves consistently as we increase the number of responses $|\mathcal{R}_N|$ used to construct the claim node set $\mathcal{C}$ in our uncertainty estimation method. While all evaluations are conducted on the same fixed set of claims, varying $|\mathcal{R}_N|$ alters the graph structure used to estimate these claims' uncertainty values.
  • Figure 5: UAD results with PH-VC included for breaking ties in the SC scores. The plot shows the trade-off between factuality and informativeness for various UAD variants and baselines.
  • ...and 2 more figures