Table of Contents
Fetching ...

STORYSUMM: Evaluating Faithfulness in Story Summarization

Melanie Subbiah, Faisal Ladhak, Akankshya Mishra, Griffin Adams, Lydia B. Chilton, Kathleen McKeown

TL;DR

StorySumm presents a narrative-faithfulness benchmark comprising 96 short-story pairs with 500+ sentence-level labels to evaluate how well LLM-generated summaries reflect source stories. The authors show that no single human annotation protocol fully captures inconsistencies and that automatic metrics currently underperform, with the best system achieving around two-thirds balanced accuracy. They advocate for multiple ground-truth protocols and expanded gold labels to better calibrate evaluation and reduce label inflation. The work provides concrete recommendations for evaluating narrative summaries and highlights the need for improved automatic methods to detect subtle story-level inconsistencies, informing future research in faithful narrative summarization.

Abstract

Human evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, STORYSUMM, comprising LLM summaries of short stories with localized faithfulness labels and error explanations. This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies. Using this dataset, we first show that any one human annotation protocol is likely to miss inconsistencies, and we advocate for pursuing a range of methods when establishing ground truth for a summarization dataset. We finally test recent automatic metrics and find that none of them achieve more than 70% balanced accuracy on this task, demonstrating that it is a challenging benchmark for future work in faithfulness evaluation.

STORYSUMM: Evaluating Faithfulness in Story Summarization

TL;DR

StorySumm presents a narrative-faithfulness benchmark comprising 96 short-story pairs with 500+ sentence-level labels to evaluate how well LLM-generated summaries reflect source stories. The authors show that no single human annotation protocol fully captures inconsistencies and that automatic metrics currently underperform, with the best system achieving around two-thirds balanced accuracy. They advocate for multiple ground-truth protocols and expanded gold labels to better calibrate evaluation and reduce label inflation. The work provides concrete recommendations for evaluating narrative summaries and highlights the need for improved automatic methods to detect subtle story-level inconsistencies, informing future research in faithful narrative summarization.

Abstract

Human evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, STORYSUMM, comprising LLM summaries of short stories with localized faithfulness labels and error explanations. This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies. Using this dataset, we first show that any one human annotation protocol is likely to miss inconsistencies, and we advocate for pursuing a range of methods when establishing ground truth for a summarization dataset. We finally test recent automatic metrics and find that none of them achieve more than 70% balanced accuracy on this task, demonstrating that it is a challenging benchmark for future work in faithfulness evaluation.
Paper Structure (26 sections, 10 figures, 12 tables)

This paper contains 26 sections, 10 figures, 12 tables.

Figures (10)

  • Figure 1: A StorySumm example illustrating an incorrect interpretation of double entendre. A standard fine-grained human annotation protocol missed this inconsistency even though it is obvious once pointed out.
  • Figure 2: An example of the hybrid method generated inconsistencies, which are all incorrect in this case. #2 and #3 are details that are consistent between the summary and story. #1 convinces annotators, but is actually consistent with the story.
  • Figure 3: Confusion matrices of the expert and hybrid labels against the annotator labels.
  • Figure 4: Confusion matrices of label overlap between the three human annotation methods and the expanded gold set of labels.
  • Figure 5: Streamlit instructions for the annotator labels. Other methods have slight variations on these instructions based on their format.
  • ...and 5 more figures