Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering
Yavuz Bakman, Sungmin Kang, Zhiqi Huang, Duygu Nur Yaldiz, Catarina G. Belém, Chenyang Zhu, Anoop Kumar, Alfy Samuel, Salman Avestimehr, Daben Liu, Sai Praneeth Karimireddy
TL;DR
This work tackles epistemic uncertainty quantification for contextual QA by formulating a theoretically grounded total uncertainty measure and bounding epistemic uncertainty through the distance between an actual model and an ideal prompt-driven model. It posits that the gap can be captured by a small set of semantic feature directions—context reliance, context comprehension, and honesty—extracted with minimal labeled data via a top-down interpretability approach and combined in a sampling-free ensemble. Empirical results on Qasper, HotpotQA, and NarrativeQA across multiple models show state-of-the-art performance against unsupervised and supervised baselines, with notable robustness under distribution shift and in low-data regimes. The framework provides a principled, efficient pathway to reliable UQ in contextual QA, with potential for discovering additional features and extending to broader NLP tasks.
Abstract
Uncertainty Quantification (UQ) research has primarily focused on closed-book factual question answering (QA), while contextual QA remains unexplored, despite its importance in real-world applications. In this work, we focus on UQ for the contextual QA task and propose a theoretically grounded approach to quantify epistemic uncertainty. We begin by introducing a task-agnostic, token-level uncertainty measure defined as the cross-entropy between the predictive distribution of the given model and the unknown true distribution. By decomposing this measure, we isolate the epistemic component and approximate the true distribution by a perfectly prompted, idealized model. We then derive an upper bound for epistemic uncertainty and show that it can be interpreted as semantic feature gaps in the given model's hidden representations relative to the ideal model. We further apply this generic framework to the contextual QA task and hypothesize that three features approximate this gap: context-reliance (using the provided context rather than parametric knowledge), context comprehension (extracting relevant information from context), and honesty (avoiding intentional lies). Using a top-down interpretability approach, we extract these features by using only a small number of labeled samples and ensemble them to form a robust uncertainty score. Experiments on multiple QA benchmarks in both in-distribution and out-of-distribution settings show that our method substantially outperforms state-of-the-art unsupervised (sampling-free and sampling-based) and supervised UQ methods, achieving up to a 13-point PRR improvement while incurring a negligible inference overhead.
