Fine-Grained Natural Language Inference Based Faithfulness Evaluation for Diverse Summarisation Tasks
Huajian Zhang, Yumo Xu, Laura Perez-Beltrachini
TL;DR
This work tackles the challenge of evaluating summary faithfulness by critiquing fixed-premise NLI approaches and introducing InFusE, which uses variable premise sizes and sub-sentence reasoning to better capture entailment relations in document–summary pairs. It extends beyond single-document news by introducing DiverSumm, a long-form, multi-task benchmark with rich faithfulness annotations. InFusE employs incremental reasoning, reversed entailment, and sub-sentence decomposition to produce more accurate, interpretable faithfulness estimates across diverse summarisation tasks, outperforming strong baselines on ROC-AUC. The approach improves practical evaluation of abstractive summaries and offers a flexible framework for multi-task faithfulness assessment, with publicly available code and data supporting reproducibility and further research.
Abstract
We study existing approaches to leverage off-the-shelf Natural Language Inference (NLI) models for the evaluation of summary faithfulness and argue that these are sub-optimal due to the granularity level considered for premises and hypotheses. That is, the smaller content unit considered as hypothesis is a sentence and premises are made up of a fixed number of document sentences. We propose a novel approach, namely InFusE, that uses a variable premise size and simplifies summary sentences into shorter hypotheses. Departing from previous studies which focus on single short document summarisation, we analyse NLI based faithfulness evaluation for diverse summarisation tasks. We introduce DiverSumm, a new benchmark comprising long form summarisation (long documents and summaries) and diverse summarisation tasks (e.g., meeting and multi-document summarisation). In experiments, InFusE obtains superior performance across the different summarisation tasks. Our code and data are available at https://github.com/HJZnlp/infuse.
