Table of Contents
Fetching ...

LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization

Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, Kyle Lo

TL;DR

This paper tackles the challenge of faithfully evaluating long-form summaries, where human judgments are costly and practices are inconsistent. It introduces LongEval, a set of guidelines advocating fine-grained (unit-level) evaluation, partial annotation to reduce workload, and optional source-alignment hints, and validates these on SQuALITY and PubMed. Key findings show a strong reduction in inter-annotator variance with fine-grained judgments, high but cost-effective correlation from partial annotations, and limited utility for automated highlighting hints, especially in non-extractive cases. The authors also release their annotated data, annotation templates, and a Python library to support reproducible long-form faithfulness evaluation. Together, these contributions offer a practical, standardized pathway for assessing faithfulness in long-form summarization and informing future automatic metrics.

Abstract

While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall's tau using 50% judgments). We release our human judgments, annotation templates, and our software as a Python library for future research.

LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization

TL;DR

This paper tackles the challenge of faithfully evaluating long-form summaries, where human judgments are costly and practices are inconsistent. It introduces LongEval, a set of guidelines advocating fine-grained (unit-level) evaluation, partial annotation to reduce workload, and optional source-alignment hints, and validates these on SQuALITY and PubMed. Key findings show a strong reduction in inter-annotator variance with fine-grained judgments, high but cost-effective correlation from partial annotations, and limited utility for automated highlighting hints, especially in non-extractive cases. The authors also release their annotated data, annotation templates, and a Python library to support reproducible long-form faithfulness evaluation. Together, these contributions offer a practical, standardized pathway for assessing faithfulness in long-form summarization and informing future automatic metrics.

Abstract

While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall's tau using 50% judgments). We release our human judgments, annotation templates, and our software as a Python library for future research.
Paper Structure (20 sections, 1 equation, 9 figures, 11 tables, 1 algorithm)

This paper contains 20 sections, 1 equation, 9 figures, 11 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of research questions considered in LongEval. Example summary taken from SQuALITY.
  • Figure 2: 95% confidence intervals of Pearson correlations between various automatic evaluation metrics and using human evaluation data collected with fine (blue) and coarse (orange) annotation methods. In both datasets, fine annotations lead to much narrower CIs than coarse annotations. See \ref{['sec:metric-correlations-kt']} for plot with Kendall's Tau.
  • Figure 3: 95% confidence intervals of estimated model performances using fine (blue) and coarse (orange) annotation methods. Intervals calculated using bootstrap resampling across annotators (\ref{['sec:bootstrap']}). While both annotation granularities lead to similar relative ordering of systems, fine annotations have narrower confidence intervals. The higher LongT5 score vs human in PubMed is due to highly extractive LongT5 summaries (\ref{['sec:longeval']}).
  • Figure 4: Accuracy and variance after annotating a fraction of units per summary (X-axis) with fine. Despite annotating just a fraction of the summary, we observe a high segment-level Kendall tau correlation with a full annotation (left). However we observe higher inter-annotator variance as the fraction reduces (right). Confidence intervals shown are 95% and computed across 1000 random subsets (see \ref{['appendix:partial-pearson']} for left plot with Pearson).
  • Figure 5: Learning effect over time while evaluating long-form summaries with fine annotation. As the annotators evaluate more summary units, they learn the document better and are much faster at annotation irrespective of whether hints are shown to them.
  • ...and 4 more figures