Table of Contents
Fetching ...

A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models

Mingruo Yuan, Shuyi Zhang, Ben Kao

TL;DR

CRUX addresses the challenge of trustworthy confidence estimation in context-dependent QA by introducing two complementary metrics: contextual information gain through entropy reduction when context is present, and unified consistency that gauges output stability across context-conditioned and context-free conditions. A neural fusion module combines these signals into a final confidence score, enabling explicit grounding in the provided context while distinguishing model uncertainty from data uncertainty. Across five datasets (CoQA, SQuAD, QuAC, BioASQ, EduQG) and two LLMs (Llama-8B, Qwen-14B), CRUX achieves state-of-the-art AUROC, with ablations demonstrating the critical roles of clustering and global consistency. The framework advances practical reliability for context-aware NLG systems and suggests future work in integrating retrieval, per-claim confidence, and error localization to further enhance interpretability and deployment in safety-critical domains.

Abstract

Accurate confidence estimation is essential for trustworthy large language models (LLMs) systems, as it empowers the user to determine when to trust outputs and enables reliable deployment in safety-critical applications. Current confidence estimation methods for LLMs neglect the relevance between responses and contextual information, a crucial factor in output quality evaluation, particularly in scenarios where background knowledge is provided. To bridge this gap, we propose CRUX (Context-aware entropy Reduction and Unified consistency eXamination), the first framework that integrates context faithfulness and consistency for confidence estimation via two novel metrics. First, contextual entropy reduction represents data uncertainty with the information gain through contrastive sampling with and without context. Second, unified consistency examination captures potential model uncertainty through the global consistency of the generated answers with and without context. Experiments across three benchmark datasets (CoQA, SQuAD, QuAC) and two domain-specific datasets (BioASQ, EduQG) demonstrate CRUX's effectiveness, achieving the highest AUROC than existing baselines.

A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models

TL;DR

CRUX addresses the challenge of trustworthy confidence estimation in context-dependent QA by introducing two complementary metrics: contextual information gain through entropy reduction when context is present, and unified consistency that gauges output stability across context-conditioned and context-free conditions. A neural fusion module combines these signals into a final confidence score, enabling explicit grounding in the provided context while distinguishing model uncertainty from data uncertainty. Across five datasets (CoQA, SQuAD, QuAC, BioASQ, EduQG) and two LLMs (Llama-8B, Qwen-14B), CRUX achieves state-of-the-art AUROC, with ablations demonstrating the critical roles of clustering and global consistency. The framework advances practical reliability for context-aware NLG systems and suggests future work in integrating retrieval, per-claim confidence, and error localization to further enhance interpretability and deployment in safety-critical domains.

Abstract

Accurate confidence estimation is essential for trustworthy large language models (LLMs) systems, as it empowers the user to determine when to trust outputs and enables reliable deployment in safety-critical applications. Current confidence estimation methods for LLMs neglect the relevance between responses and contextual information, a crucial factor in output quality evaluation, particularly in scenarios where background knowledge is provided. To bridge this gap, we propose CRUX (Context-aware entropy Reduction and Unified consistency eXamination), the first framework that integrates context faithfulness and consistency for confidence estimation via two novel metrics. First, contextual entropy reduction represents data uncertainty with the information gain through contrastive sampling with and without context. Second, unified consistency examination captures potential model uncertainty through the global consistency of the generated answers with and without context. Experiments across three benchmark datasets (CoQA, SQuAD, QuAC) and two domain-specific datasets (BioASQ, EduQG) demonstrate CRUX's effectiveness, achieving the highest AUROC than existing baselines.

Paper Structure

This paper contains 25 sections, 6 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of Traditional Consistency vs. CRUX Methodology: (a) Conventional approaches focus exclusively on response consistency. (b) The CRUX framework enhances evaluation by combining contextual faithfulness (assessed via contextual information gain) with global consistency.
  • Figure 2: AUROC Curves for CoQA under Llama-8B
  • Figure 3: AUROC Curves for SQuAD under Qwen-14B
  • Figure 4: Case Study. The left panel (Case 1) demonstrates high-quality responses where context resolves confusion (label=1), while the right panel (Case 2) shows hallucination-prone answers where responses fail to answer the question (label=0).