Table of Contents
Fetching ...

Discourse-Driven Evaluation: Unveiling Factual Inconsistency in Long Document Summarization

Yang Zhong, Diane Litman

TL;DR

The paper tackles factual inconsistency in long-document summarization by introducing StructScore, a discourse-informed evaluation framework that leverages RST-based discourse structure to both segment long texts into cohesive chunks and re-weight sentence-level NLI scores. By analyzing discourse features such as EDU density and nucleus-satellite relations, the approach identifies where errors are likeliest to occur and adjusts scoring accordingly. Empirical results across AggreFact-FtSOTA, DiverSumm, LongSciVerify, LongEval, and LegalSumm show that StructScore improves factuality detection over strong baselines and maintains competitiveness with large-language-model-based approaches, with notable gains on long, domain-diverse texts. The work also discusses computation costs, limitations of current discourse parsers, and avenues for generalization to other tasks, underscoring the value of incorporating discourse information into long-document evaluation and interpretability.

Abstract

Detecting factual inconsistency for long document summarization remains challenging, given the complex structure of the source article and long summary length. In this work, we study factual inconsistency errors and connect them with a line of discourse analysis. We find that errors are more common in complex sentences and are associated with several discourse features. We propose a framework that decomposes long texts into discourse-inspired chunks and utilizes discourse information to better aggregate sentence-level scores predicted by natural language inference models. Our approach shows improved performance on top of different model baselines over several evaluation benchmarks, covering rich domains of texts, focusing on long document summarization. This underscores the significance of incorporating discourse features in developing models for scoring summaries for long document factual inconsistency.

Discourse-Driven Evaluation: Unveiling Factual Inconsistency in Long Document Summarization

TL;DR

The paper tackles factual inconsistency in long-document summarization by introducing StructScore, a discourse-informed evaluation framework that leverages RST-based discourse structure to both segment long texts into cohesive chunks and re-weight sentence-level NLI scores. By analyzing discourse features such as EDU density and nucleus-satellite relations, the approach identifies where errors are likeliest to occur and adjusts scoring accordingly. Empirical results across AggreFact-FtSOTA, DiverSumm, LongSciVerify, LongEval, and LegalSumm show that StructScore improves factuality detection over strong baselines and maintains competitiveness with large-language-model-based approaches, with notable gains on long, domain-diverse texts. The work also discusses computation costs, limitations of current discourse parsers, and avenues for generalization to other tasks, underscoring the value of incorporating discourse information into long-document evaluation and interpretability.

Abstract

Detecting factual inconsistency for long document summarization remains challenging, given the complex structure of the source article and long summary length. In this work, we study factual inconsistency errors and connect them with a line of discourse analysis. We find that errors are more common in complex sentences and are associated with several discourse features. We propose a framework that decomposes long texts into discourse-inspired chunks and utilizes discourse information to better aggregate sentence-level scores predicted by natural language inference models. Our approach shows improved performance on top of different model baselines over several evaluation benchmarks, covering rich domains of texts, focusing on long document summarization. This underscores the significance of incorporating discourse features in developing models for scoring summaries for long document factual inconsistency.

Paper Structure

This paper contains 62 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Our proposed approach to faithfulness inconsistency detection utilizes findings from discourse analysis. We first conduct discourse analysis on parsed summary sentences (§ \ref{['sec:summary_error_analysis']}) and exploit the source document's discourse structure (§ \ref{['sec:document_structure']}). Motivated by the findings, our proposed approach is introduced in § \ref{['sec:source_segment']} and § \ref{['sec:reweight_algorithm']}.
  • Figure 2: Average shortest path length per dataset for document and summary discourse trees. We sort the dataset by the average length of the document, finding that longer document-summary (DOC, SUMM) pairs would be more branched, and their summaries are also complicated. AG, DS, LSV, and LE refer to AggreFact FtSOTA, DiverSumm, LongSciVerify and LongEval respectively.
  • Figure 3: RST for the example sentence, and the salient units (promotion set) of each text span are shown above the horizontal line, which represents the span.The example is taken from louis-etal-2010-discourse.
  • Figure 4: The annotation interface for LegalSumm. The left panel displays the instructions and the content to be annotated. Annotators are then prompted to select one of four options, as shown in the right panel.
  • Figure 5: Example of segmentation failures, left is the output of chunking method used in AlignScore and MiniCheck, right is the segments produced by our segmentation method.
  • ...and 1 more figures