Table of Contents
Fetching ...

Long-Form Information Alignment Evaluation Beyond Atomic Facts

Danna Zheng, Mirella Lapata, Jeff Z. Pan

TL;DR

This work reveals a vulnerability in long-form information alignment evaluation: even when all atomic facts are true, rearranging them can mislead readers. It introduces MontageLie, a benchmark that constructs montage-style lies by reordering truthful statements to subtly alter causal narratives, and shows that both coarse- and fine-grained evaluators struggle with this attack. To address this, the authors propose DoveScore, a fine-grained framework that jointly verifies atomic facts and event ordering, achieving a substantial performance gain over existing methods. The results underscore the need for order-aware evaluation in long-form text and offer a modular framework for constructing more robust alignment evaluators with practical implications for trustworthy NLG and LLM deployment.

Abstract

Information alignment evaluators are vital for various NLG evaluation tasks and trustworthy LLM deployment, reducing hallucinations and enhancing user trust. Current fine-grained methods, like FactScore, verify facts individually but neglect inter-fact dependencies, enabling subtle vulnerabilities. In this work, we introduce MontageLie, a challenging benchmark that constructs deceptive narratives by "montaging" truthful statements without introducing explicit hallucinations. We demonstrate that both coarse-grained LLM-based evaluators and current fine-grained frameworks are susceptible to this attack, with AUC-ROC scores falling below 65%. To enable more robust fine-grained evaluation, we propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency. By modeling inter-fact relationships, DoveScore outperforms existing fine-grained methods by over 8%, providing a more robust solution for long-form text alignment evaluation. Our code and datasets are available at https://github.com/dannalily/DoveScore.

Long-Form Information Alignment Evaluation Beyond Atomic Facts

TL;DR

This work reveals a vulnerability in long-form information alignment evaluation: even when all atomic facts are true, rearranging them can mislead readers. It introduces MontageLie, a benchmark that constructs montage-style lies by reordering truthful statements to subtly alter causal narratives, and shows that both coarse- and fine-grained evaluators struggle with this attack. To address this, the authors propose DoveScore, a fine-grained framework that jointly verifies atomic facts and event ordering, achieving a substantial performance gain over existing methods. The results underscore the need for order-aware evaluation in long-form text and offer a modular framework for constructing more robust alignment evaluators with practical implications for trustworthy NLG and LLM deployment.

Abstract

Information alignment evaluators are vital for various NLG evaluation tasks and trustworthy LLM deployment, reducing hallucinations and enhancing user trust. Current fine-grained methods, like FactScore, verify facts individually but neglect inter-fact dependencies, enabling subtle vulnerabilities. In this work, we introduce MontageLie, a challenging benchmark that constructs deceptive narratives by "montaging" truthful statements without introducing explicit hallucinations. We demonstrate that both coarse-grained LLM-based evaluators and current fine-grained frameworks are susceptible to this attack, with AUC-ROC scores falling below 65%. To enable more robust fine-grained evaluation, we propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency. By modeling inter-fact relationships, DoveScore outperforms existing fine-grained methods by over 8%, providing a more robust solution for long-form text alignment evaluation. Our code and datasets are available at https://github.com/dannalily/DoveScore.

Paper Structure

This paper contains 49 sections, 5 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: The figure illustrates the limitation of existing fine-grained evaluators such as FactScore and AlignScore, which struggle to detect lies composed of the exact small units that make up the truth.
  • Figure 2: Violin plots of scores from gpt-4o-mini on MontageLie. The similar distributions for original and rephrased targets indicate robustness to rephrasing. Comparable trends are observed for other evaluators (see Appendix \ref{['App:ScoreDist']}).
  • Figure 3: The illustration of DoveScore which includes three core components: the Decomposer, the Fact Checker, and the Sorter.
  • Figure 4: Score Distribution Comparison of Fine-grained Evaluators. SummaC exhibits a similar pattern to AlignScore, assigning low scores to both correct and wrong target texts (See Appendix \ref{['App:ScoreDist']}).
  • Figure 5: Distribution of words length in MontageLie benchmark.
  • ...and 5 more figures