AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation

Haoyi Qiu; Kung-Hsiang Huang; Jingnong Qu; Nanyun Peng

AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation

Haoyi Qiu, Kung-Hsiang Huang, Jingnong Qu, Nanyun Peng

TL;DR

AMRFact introduces a novel factuality evaluation framework that uses Abstract Meaning Representations to generate coherent, factually inconsistent (negative) summaries with broad error-type coverage. A dedicated NegFilter module validates negative samples by enforcing entailment distinctiveness and source relevance, yielding higher-quality training data. A RoBERTa-based evaluator is trained on AMRFact-generated data to assess factual consistency as an entailment task, achieving state-of-the-art performance on the AggreFact FtSota benchmark and strong results on CNN/DM and XSum. The work highlights the effectiveness of AMR-guided perturbations for data construction and demonstrates the importance of data quality controls in improving factuality detection with abstractive summarization.

Abstract

Ensuring factual consistency is crucial for natural language generation tasks, particularly in abstractive summarization, where preserving the integrity of information is paramount. Prior works on evaluating factual consistency of summarization often take the entailment-based approaches that first generate perturbed (factual inconsistent) summaries and then train a classifier on the generated data to detect the factually inconsistencies during testing time. However, previous approaches generating perturbed summaries are either of low coherence or lack error-type coverage. To address these issues, we propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs). Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage. Additionally, we present a data selection module NegFilter based on natural language inference and BARTScore to ensure the quality of the generated negative samples. Experimental results demonstrate our approach significantly outperforms previous systems on the AggreFact-SOTA benchmark, showcasing its efficacy in evaluating factuality of abstractive summarization.

AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation

TL;DR

Abstract

Paper Structure (36 sections, 1 equation, 6 figures, 9 tables)

This paper contains 36 sections, 1 equation, 6 figures, 9 tables.

Introduction
Abstract Meaning Representations
AMRFact
AMR-based Summary Perturbations
Predicate Error.
Entity Error.
Circumstance Error.
Discourse Link Error.
Out of Article Error.
Invalid Negative Data Filtering
Detecting Factual Inconsistency
Experimental Settings
Datasets
Training Dataset
Evaluation Dataset
...and 21 more sections

Figures (6)

Figure 1: Example of a reference (green) and a generated factually inconsistent summary (red) from the AMRFact dataset. Given a reference summary, we convert the text into an AMR graph (grey) and then remove " consider-02" to generate a factually inconsistent summary AMR graph (yellow). This perturbed summary strengthens the modality in the reference summary, resulting in factual inconsistency. The reference and perturbed summaries will be used as positive and negative examples, respectively.
Figure 2: Overview of AMRFact training phase: (1) The generation module first converts the reference summaries into AMR graphs. (2) These graphs are then manipulated to include common factual errors shown in current summarization systems, creating factually inconsistent AMR graphs. (3) These manipulated graphs are back-translated into text summaries, serving as negative examples for training a text-based factuality evaluator. (4) A selection module, using NLI score and BARTScore, filters out low-quality negative examples. (5) Finally, we fine-tune a RoBERTa-based model with this data to act as the evaluation metric, assessing factuality by comparing the original document (premise) with the summary (hypothesis) and measuring the probability of entailment.
Figure 3: Typology of factual errors. Given the source document and reference summary, we apply five kinds of factual inconsistencies: predicate error, entity error, circumstance error, discourse link error, and out-of-article error. Each color represents the implementation of one kind of factual error from reference summary to perturbed summary.
Figure 4: An illustration showing how our invalid negative data filtering module works. In the above three examples, only the first perturbed summary is valid since both of its entailment score and BARTScore satisfy the criteria described in \ref{['sec:negative-examples-selection']}.
Figure 5: A breakdown of coherence scores for negative summaries produced by AMRFact and FactCC.
...and 1 more figures

AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation

TL;DR

Abstract

AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)