Table of Contents
Fetching ...

Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers

Melanie Subbiah, Sean Zhang, Lydia B. Chilton, Kathleen McKeown

TL;DR

This study evaluates GPT-4, Claude-2.1, and Llama-2-70B on unseen unpublished short stories (≤$9900$ tokens) by engaging professional writers to assess LLM-generated summaries across span, summary, and story levels using narrative-theory grounded criteria. It finds that while top models can produce high-quality summaries, faithfulness and subtext interpretation remain significant challenges, with models accurate in only about $50\%$ of cases and with poor alignment to human judgments when used as evaluators. The work also shows substantial variability with writing style (unreliable narrators, nonlinear timelines) and demonstrates that automatic metrics do not reliably predict human writer ratings. Overall, the paper argues for human-in-the-loop evaluation and collaboration with domain experts to robustly assess narrative understanding and to avoid training-data leakage, offering a practical methodology for future research in long-form narrative AI.

Abstract

We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.

Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers

TL;DR

This study evaluates GPT-4, Claude-2.1, and Llama-2-70B on unseen unpublished short stories (≤ tokens) by engaging professional writers to assess LLM-generated summaries across span, summary, and story levels using narrative-theory grounded criteria. It finds that while top models can produce high-quality summaries, faithfulness and subtext interpretation remain significant challenges, with models accurate in only about of cases and with poor alignment to human judgments when used as evaluators. The work also shows substantial variability with writing style (unreliable narrators, nonlinear timelines) and demonstrates that automatic metrics do not reliably predict human writer ratings. Overall, the paper argues for human-in-the-loop evaluation and collaboration with domain experts to robustly assess narrative understanding and to avoid training-data leakage, offering a practical methodology for future research in long-form narrative AI.

Abstract

We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.
Paper Structure (25 sections, 11 figures, 7 tables)

This paper contains 25 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: The two different methods we use for summarization and the associated prompts for the models. GPT-4 and Claude have sufficient input context to summarize a whole story, whereas Llama has to use a chunk-then-summarize approach for longer stories.
  • Figure 2: Interface screenshots showing the questions writers are asked to evaluate the summaries of their stories using a 4-point Likert scale.
  • Figure 3: Examples of openings from stories scored at different reading-levels by the Flesch-Kincaid score.
  • Figure 4: Distribution of Likert score ratings for each model's summaries by attribute.
  • Figure 5: Examples of some of the best analysis-focused sentences from GPT-4 and Claude summaries.
  • ...and 6 more figures