Table of Contents
Fetching ...

Stress Testing Factual Consistency Metrics for Long-Document Summarization

Zain Muhammad Mujahid, Dustin Wright, Isabelle Augenstein

TL;DR

This work addresses the challenge of factuality in long-document abstractive summarization, where legacy reference-free metrics struggle with length, dispersion, and cross-document evidence. It introduces a stress-testing protocol applying seven meaning-preserving perturbations to summaries and a retrieval-based scoring framework across three long-form domains, evaluating six metrics: BARTScore, SummaC-Conv, SummaC-ZS, AlignScore, UniEval, and MiniCheck, with retrieval scoring defined by score(s_j) = \max_{k} M(s_j, d_{j,k}^{(w)}) and context window size $w$. Key findings show substantial variability and domain-dependent weaknesses, with retrieval context improving some metrics but not guaranteeing factual alignment for information-dense claims (measured via $Sim(s_j, D)$). The paper proposes directions including multi-span reasoning, context-aware calibration, perturbation-aware training, and hybrid evaluation signals, and provides code, perturbed data, and scripts at the linked repository for reproducibility.

Abstract

Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where conventional metrics struggle with input length limitations and long-range dependencies. In this work, we systematically evaluate the reliability of six widely used reference-free factuality metrics, originally proposed for short-form summarization, in the long-document setting. We probe metric robustness through seven factuality-preserving perturbations applied to summaries, namely paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion, and further analyze their sensitivity to retrieval context and claim information density. Across three long-form benchmark datasets spanning science fiction, legal, and scientific domains, our results reveal that existing short-form metrics produce inconsistent scores for semantically equivalent summaries and exhibit declining reliability for information-dense claims whose content is semantically similar to many parts of the source document. While expanding the retrieval context improves stability in some domains, no metric consistently maintains factual alignment under long-context conditions. Finally, our results highlight concrete directions for improving factuality evaluation, including multi-span reasoning, context-aware calibration, and training on meaning-preserving variations to enhance robustness in long-form summarization. We release all code, perturbed data, and scripts required to reproduce our results at https://github.com/zainmujahid/metricEval-longSum.

Stress Testing Factual Consistency Metrics for Long-Document Summarization

TL;DR

This work addresses the challenge of factuality in long-document abstractive summarization, where legacy reference-free metrics struggle with length, dispersion, and cross-document evidence. It introduces a stress-testing protocol applying seven meaning-preserving perturbations to summaries and a retrieval-based scoring framework across three long-form domains, evaluating six metrics: BARTScore, SummaC-Conv, SummaC-ZS, AlignScore, UniEval, and MiniCheck, with retrieval scoring defined by score(s_j) = \max_{k} M(s_j, d_{j,k}^{(w)}) and context window size . Key findings show substantial variability and domain-dependent weaknesses, with retrieval context improving some metrics but not guaranteeing factual alignment for information-dense claims (measured via ). The paper proposes directions including multi-span reasoning, context-aware calibration, perturbation-aware training, and hybrid evaluation signals, and provides code, perturbed data, and scripts at the linked repository for reproducibility.

Abstract

Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where conventional metrics struggle with input length limitations and long-range dependencies. In this work, we systematically evaluate the reliability of six widely used reference-free factuality metrics, originally proposed for short-form summarization, in the long-document setting. We probe metric robustness through seven factuality-preserving perturbations applied to summaries, namely paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion, and further analyze their sensitivity to retrieval context and claim information density. Across three long-form benchmark datasets spanning science fiction, legal, and scientific domains, our results reveal that existing short-form metrics produce inconsistent scores for semantically equivalent summaries and exhibit declining reliability for information-dense claims whose content is semantically similar to many parts of the source document. While expanding the retrieval context improves stability in some domains, no metric consistently maintains factual alignment under long-context conditions. Finally, our results highlight concrete directions for improving factuality evaluation, including multi-span reasoning, context-aware calibration, and training on meaning-preserving variations to enhance robustness in long-form summarization. We release all code, perturbed data, and scripts required to reproduce our results at https://github.com/zainmujahid/metricEval-longSum.

Paper Structure

This paper contains 22 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: We aim to see how robust summary factuality metrics are for long and multi-document setups by applying meaning-preserving perturbations and comparing metric scores before and after these edits.
  • Figure 2: Score change under factuality-preserving perturbations. Boxplots show the difference in factuality score between the perturbed and original summaries, for each metric and perturbation type, across three datasets. The central dot indicates the mean score difference, and the whiskers represent the minimum and maximum values.
  • Figure 3: Relationship between claim similarity and average factuality score. Higher similarity values correspond to more information-dense claims whose content overlaps with multiple parts of the source document. Metrics generally assign lower scores to these claims for LexAbSumm and SQuALITY, and higher scores for ScholarQABench, indicating reduced reliability for compressed information.
  • Figure 4: Prompt templates used with GPT‑4o to generate meaning-preserving perturbations of the original summaries.