Table of Contents
Fetching ...

Atomic Calibration of LLMs in Long-Form Generations

Caiqi Zhang, Ruihan Yang, Zhisong Zhang, Xinting Huang, Sen Yang, Dong Yu, Nigel Collier

TL;DR

This work defines atomic calibration as fine-grained confidence alignment at the level of individual factual claims within long-form outputs, revealing that traditional response-level calibration often hides atomic-level miscalibrations. It categorizes confidence elicitation into generative and discriminative approaches and introduces two fusion strategies that exploit agreement between methods to improve calibration. Through experiments on three long-form QA datasets with seven LLMs, the authors show that atomic calibration is harder than macro calibration, yet atomic-level signals can boost overall factuality and enable downstream utilities like selective QA and atomic reunion. The findings highlight the need for fine-grained confidence estimation in long-form generation and provide practical guidance for designing robust calibration methods and fusion strategies across model sizes and architectures.

Abstract

Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, as an effective indicator of hallucination, is thus essential to enhance the trustworthiness of LLMs. Prior work mainly focuses on short-form tasks using a single response-level score (macro calibration), which is insufficient for long-form outputs that may contain both accurate and inaccurate claims. In this work, we systematically study atomic calibration, which evaluates factuality calibration at a fine-grained level by decomposing long responses into atomic claims. We further categorize existing confidence elicitation methods into discriminative and generative types, and propose two new confidence fusion strategies to improve calibration. Our experiments demonstrate that LLMs exhibit poorer calibration at the atomic level during long-form generation. More importantly, atomic calibration uncovers insightful patterns regarding the alignment of confidence methods and the changes of confidence throughout generation. This sheds light on future research directions for confidence estimation in long-form generation.

Atomic Calibration of LLMs in Long-Form Generations

TL;DR

This work defines atomic calibration as fine-grained confidence alignment at the level of individual factual claims within long-form outputs, revealing that traditional response-level calibration often hides atomic-level miscalibrations. It categorizes confidence elicitation into generative and discriminative approaches and introduces two fusion strategies that exploit agreement between methods to improve calibration. Through experiments on three long-form QA datasets with seven LLMs, the authors show that atomic calibration is harder than macro calibration, yet atomic-level signals can boost overall factuality and enable downstream utilities like selective QA and atomic reunion. The findings highlight the need for fine-grained confidence estimation in long-form generation and provide practical guidance for designing robust calibration methods and fusion strategies across model sizes and architectures.

Abstract

Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, as an effective indicator of hallucination, is thus essential to enhance the trustworthiness of LLMs. Prior work mainly focuses on short-form tasks using a single response-level score (macro calibration), which is insufficient for long-form outputs that may contain both accurate and inaccurate claims. In this work, we systematically study atomic calibration, which evaluates factuality calibration at a fine-grained level by decomposing long responses into atomic claims. We further categorize existing confidence elicitation methods into discriminative and generative types, and propose two new confidence fusion strategies to improve calibration. Our experiments demonstrate that LLMs exhibit poorer calibration at the atomic level during long-form generation. More importantly, atomic calibration uncovers insightful patterns regarding the alignment of confidence methods and the changes of confidence throughout generation. This sheds light on future research directions for confidence estimation in long-form generation.

Paper Structure

This paper contains 34 sections, 9 equations, 11 figures, 15 tables.

Figures (11)

  • Figure 1: Comparison between traditional macro calibration in response-level and our atomic calibration. The Fact. label is assigned by fact-checking module. We only list three atomic claims for illustration.
  • Figure 2: Comparison of atomic level and response-level calibration for ECE and Brier Score. Atomic-level performance is generally worse than response-level performance, with data points consistently lying above the identity line.
  • Figure 3: Heatmaps of Spearman Correlation between different confidences in Llama3-8B-Instruct on WildHallu. Warmer colors indicate higher correlations. Atomic level: left; response level: right.
  • Figure 4: Average confidence scores across different parts of long-form responses. For discriminative methods, confidence decreases as the generation progresses, while generative methods show the lowest confidence in the middle sections.
  • Figure 5: Average answer length (in words) for different models on Bios, longfact, and wildhallu.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Definition 1: Macro Calibration on Factuality
  • Definition 2: Atomic Calibration on Factuality