Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models

Jinhao Duan; Hao Cheng; Shiqi Wang; Alex Zavalny; Chenan Wang; Renjing Xu; Bhavya Kailkhura; Kaidi Xu

Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, Kaidi Xu

TL;DR

The paper identifies generative inequalities in free-form LLMs, where many tokens and sentences convey limited semantics yet disproportionately influence uncertainty estimates. It introduces Shifting Attention to Relevance (SAR), a mechanism that reweights token- and sentence-level contributions by their semantic relevance to improve uncertainty quantification. Through extensive experiments on multiple off-the-shelf and instruction-tuned LLMs across diverse QA tasks, SAR and its variants (tokenSAR, sentSAR) consistently outperform prior baselines, including Semantic Entropy, with notable gains in AUROC. The work demonstrates SAR’s potential to enhance reliability and trust in LLM outputs, especially in high-stakes domains like medicine, while acknowledging computational and accessibility trade-offs.

Abstract

Large Language Models (LLMs) show promising results in language generation and instruction following but frequently "hallucinate", making their outputs less reliable. Despite Uncertainty Quantification's (UQ) potential solutions, implementing it accurately within LLMs is challenging. Our research introduces a simple heuristic: not all tokens in auto-regressive LLM text equally represent the underlying meaning, as "linguistic redundancy" often allows a few keywords to convey the essence of long sentences. However, current methods underestimate this inequality when assessing uncertainty, causing tokens with limited semantics to be equally or excessively weighted in UQ. To correct this, we propose Shifting Attention to more Relevant (SAR) components at both token- and sentence-levels for better UQ. We conduct extensive experiments involving a range of popular "off-the-shelf" LLMs, such as Vicuna, WizardLM, and LLaMA-2-chat, with model sizes extending up to 33B parameters. We evaluate various free-form question-answering tasks, encompassing domains such as reading comprehension, science Q&A, and medical Q&A. Our experimental results, coupled with a comprehensive demographic analysis, demonstrate the superior performance of SAR. The code is available at https://github.com/jinhaoduan/SAR.

Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models

TL;DR

Abstract

Paper Structure (32 sections, 11 equations, 8 figures, 7 tables)

This paper contains 32 sections, 11 equations, 8 figures, 7 tables.

Introduction
Related Works
Uncertainty Quantification in Conventional NLP Tasks.
Uncertainty Quantification in LLMs.
Generative Inequality in Uncertainty Quantification
Preliminaries
Token-Level Generative Inequality
Sentence-Level Generative Inequality
Analytical Insights
Shifting Attention to Relevance
Notations
Relevance Discovery and Shifting
Overall Measurement
Empirical Evaluations
Experimental Settings
...and 17 more sections

Figures (8)

Figure 1: Irrelevant tokens (or sentences) may commit majority uncertainty in free-form generations, such as the token "of" committing extremely large uncertainty misleads the uncertainty quantification of LLMs. We term these observations as generative inequalities and tackle them by shifting attention to more relevant components.
Figure 2: Distributions of relevance scores in both token-level and sentence-level situations. It is shown that irrelevant tokens and sentences take considerable proportions.
Figure 3: Correlations between relevance scores and uncertainty proportions in both token-level and sentence-level situations. Irrelevant tokens and sentences dominate the total volume of uncertainty quantification.
Figure 4: The AUROCs of tokenSAR, sentSAR, SAR, and baseline methods, across various "off-the-shelf" LLMs and datasets (e.g., CoQA, and Trivia QA). Rouge-L with a threshold of 0.5 is used as the correctness metric. The proposed SAR substantially outperforms existing methods across all the scenarios.
Figure 5: The performance of SAR and baseline methods over various numbers of generations.
...and 3 more figures

Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models

TL;DR

Abstract

Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)