Fine-Grained Natural Language Inference Based Faithfulness Evaluation for Diverse Summarisation Tasks

Huajian Zhang; Yumo Xu; Laura Perez-Beltrachini

Fine-Grained Natural Language Inference Based Faithfulness Evaluation for Diverse Summarisation Tasks

Huajian Zhang, Yumo Xu, Laura Perez-Beltrachini

TL;DR

This work tackles the challenge of evaluating summary faithfulness by critiquing fixed-premise NLI approaches and introducing InFusE, which uses variable premise sizes and sub-sentence reasoning to better capture entailment relations in document–summary pairs. It extends beyond single-document news by introducing DiverSumm, a long-form, multi-task benchmark with rich faithfulness annotations. InFusE employs incremental reasoning, reversed entailment, and sub-sentence decomposition to produce more accurate, interpretable faithfulness estimates across diverse summarisation tasks, outperforming strong baselines on ROC-AUC. The approach improves practical evaluation of abstractive summaries and offers a flexible framework for multi-task faithfulness assessment, with publicly available code and data supporting reproducibility and further research.

Abstract

We study existing approaches to leverage off-the-shelf Natural Language Inference (NLI) models for the evaluation of summary faithfulness and argue that these are sub-optimal due to the granularity level considered for premises and hypotheses. That is, the smaller content unit considered as hypothesis is a sentence and premises are made up of a fixed number of document sentences. We propose a novel approach, namely InFusE, that uses a variable premise size and simplifies summary sentences into shorter hypotheses. Departing from previous studies which focus on single short document summarisation, we analyse NLI based faithfulness evaluation for diverse summarisation tasks. We introduce DiverSumm, a new benchmark comprising long form summarisation (long documents and summaries) and diverse summarisation tasks (e.g., meeting and multi-document summarisation). In experiments, InFusE obtains superior performance across the different summarisation tasks. Our code and data are available at https://github.com/HJZnlp/infuse.

Fine-Grained Natural Language Inference Based Faithfulness Evaluation for Diverse Summarisation Tasks

TL;DR

Abstract

Paper Structure (29 sections, 9 figures, 10 tables, 1 algorithm)

This paper contains 29 sections, 9 figures, 10 tables, 1 algorithm.

Introduction
Faithfulness Annotated Data for Different Summarisation Tasks
DiverSumm a New Benchmark
Error types
The Value of Adequate Premise and Hypothesis Granularity
InFusE
Incremental Reasoning
Reversed Reasoning
Sub-sentence Reasoning
Experimental Setup
Results
Faithfulness Evaluation
Performance on Different Error Types
Related work
Conclusions
...and 14 more sections

Figures (9)

Figure 1: Example of input Document (D) and Model-generated Summary Sentence (MSS) from the AggreFact aggrefact benchmark on the XSum narayan-etal-2018-dont dataset. The example is considered unfaithful by the annotators. Simplified Summary (SS) is the generated summary after automatic sentence splitting. The cyan coulored text spans in the input document highlight those document content units that support the corresponding cyan spans in the summary. Red spans in the summary indicate content that is not supported by the input document. The $\models$ MSS and $\models$ SS$_i$ columns show entailment scores assigned by an off-the-shelf NLI model to document sentences acting as premises and either MSS or SS$_i$ sentences as hypotheses. The table in the bottom shows an example of entailment relation from the MNLI dataset williams-etal-2018-broad. Entailment scores are computed by the NLI model introduced in Section \ref{['sec:experimental']} and normalised for better reading.
Figure 2: Statistics for the number of fused document sentences (the pie charts) and their distances (the blue vertical bars) on XSum and CNNDM (AggreFact) and GovReport and ChemSum (DiverSumm).
Figure 3: Distribution of entailment scores on faithful summary sentences and unfaithful ones encompassing different error types for ArXiv, GovReport and FRANK sets. The x-axe corresponds to the NLI-based approach. That is, FullDoc in red, SeNtLI in green, InFusE in cyan, and InFusE$_\textsc{sub}$ in purple. The y-axe corresponds to the entailment scores (i.e., values ranging in [0,1]), and the z-axe corresponds to the count of instances.
Figure 4: Statistics for the number of fused document sentences (the pie charts) and their distances (the blue vertical bars) on qmsum, multinews, and arxiv (DiverSumm).
Figure 5: Distribution of number of splits occurring in summary sentences.
...and 4 more figures

Fine-Grained Natural Language Inference Based Faithfulness Evaluation for Diverse Summarisation Tasks

TL;DR

Abstract

Fine-Grained Natural Language Inference Based Faithfulness Evaluation for Diverse Summarisation Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (9)