Table of Contents
Fetching ...

Fact-Level Confidence Calibration and Self-Correction

Yige Yuan, Bingbing Xu, Hexiang Tan, Fei Sun, Teng Xiao, Wei Li, Huawei Shen, Xueqi Cheng

TL;DR

A Fact-Level Calibration framework is proposed that operates at a finer granularity, calibrating confidence to relevance-weighted correctness at the fact level and inspired the development of Confidence-Guided Fact-level Self-Correction, which uses high-confidence facts within a response as additional knowledge to improve low-confidence ones.

Abstract

Confidence calibration in LLMs, i.e., aligning their self-assessed confidence with the actual accuracy of their responses, enabling them to self-evaluate the correctness of their outputs. However, current calibration methods for LLMs typically estimate two scalars to represent overall response confidence and correctness, which is inadequate for long-form generation where the response includes multiple atomic facts and may be partially confident and correct. These methods also overlook the relevance of each fact to the query. To address these challenges, we propose a Fact-Level Calibration framework that operates at a finer granularity, calibrating confidence to relevance-weighted correctness at the fact level. Furthermore, comprehensive analysis under the framework inspired the development of Confidence-Guided Fact-level Self-Correction ($\textbf{ConFix}$), which uses high-confidence facts within a response as additional knowledge to improve low-confidence ones. Extensive experiments across four datasets and six models demonstrate that ConFix effectively mitigates hallucinations without requiring external knowledge sources such as retrieval systems.

Fact-Level Confidence Calibration and Self-Correction

TL;DR

A Fact-Level Calibration framework is proposed that operates at a finer granularity, calibrating confidence to relevance-weighted correctness at the fact level and inspired the development of Confidence-Guided Fact-level Self-Correction, which uses high-confidence facts within a response as additional knowledge to improve low-confidence ones.

Abstract

Confidence calibration in LLMs, i.e., aligning their self-assessed confidence with the actual accuracy of their responses, enabling them to self-evaluate the correctness of their outputs. However, current calibration methods for LLMs typically estimate two scalars to represent overall response confidence and correctness, which is inadequate for long-form generation where the response includes multiple atomic facts and may be partially confident and correct. These methods also overlook the relevance of each fact to the query. To address these challenges, we propose a Fact-Level Calibration framework that operates at a finer granularity, calibrating confidence to relevance-weighted correctness at the fact level. Furthermore, comprehensive analysis under the framework inspired the development of Confidence-Guided Fact-level Self-Correction (), which uses high-confidence facts within a response as additional knowledge to improve low-confidence ones. Extensive experiments across four datasets and six models demonstrate that ConFix effectively mitigates hallucinations without requiring external knowledge sources such as retrieval systems.

Paper Structure

This paper contains 48 sections, 7 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Motivation of our fact-level confidence calibration and confidence-guided self-correction.
  • Figure 2: An illustration of our fact-level confidence calibration framework for fine-grained LLM calibration.
  • Figure 3: Comparison of calibration measures between fact-level and response-level across models of three different scales: LLaMA-7B/13B, and GPT-3.5, indicating fact-level imposes a stricter standard (Observation 1).
  • Figure 4: Comparison of confidence distribution across different responses between fact-level and response-level, with the purple plots representing fact-level distribution under different statistical metrics and the gray plots showing the response-level distribution, highlighting the over-confidence issue at the response level, which stems from the dominance of the implicit high-confidence facts hidden within the response (Observation 2).
  • Figure 5: Confidence distribution within individual responses at fact level, with red bar indicating the response-level confidence score, indicating high variance exists in fact-level confidence within a single response (Observation 3).
  • ...and 1 more figures