A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models
Mingruo Yuan, Shuyi Zhang, Ben Kao
TL;DR
CRUX addresses the challenge of trustworthy confidence estimation in context-dependent QA by introducing two complementary metrics: contextual information gain through entropy reduction when context is present, and unified consistency that gauges output stability across context-conditioned and context-free conditions. A neural fusion module combines these signals into a final confidence score, enabling explicit grounding in the provided context while distinguishing model uncertainty from data uncertainty. Across five datasets (CoQA, SQuAD, QuAC, BioASQ, EduQG) and two LLMs (Llama-8B, Qwen-14B), CRUX achieves state-of-the-art AUROC, with ablations demonstrating the critical roles of clustering and global consistency. The framework advances practical reliability for context-aware NLG systems and suggests future work in integrating retrieval, per-claim confidence, and error localization to further enhance interpretability and deployment in safety-critical domains.
Abstract
Accurate confidence estimation is essential for trustworthy large language models (LLMs) systems, as it empowers the user to determine when to trust outputs and enables reliable deployment in safety-critical applications. Current confidence estimation methods for LLMs neglect the relevance between responses and contextual information, a crucial factor in output quality evaluation, particularly in scenarios where background knowledge is provided. To bridge this gap, we propose CRUX (Context-aware entropy Reduction and Unified consistency eXamination), the first framework that integrates context faithfulness and consistency for confidence estimation via two novel metrics. First, contextual entropy reduction represents data uncertainty with the information gain through contrastive sampling with and without context. Second, unified consistency examination captures potential model uncertainty through the global consistency of the generated answers with and without context. Experiments across three benchmark datasets (CoQA, SQuAD, QuAC) and two domain-specific datasets (BioASQ, EduQG) demonstrate CRUX's effectiveness, achieving the highest AUROC than existing baselines.
