Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers
Melanie Subbiah, Sean Zhang, Lydia B. Chilton, Kathleen McKeown
TL;DR
This study evaluates GPT-4, Claude-2.1, and Llama-2-70B on unseen unpublished short stories (≤$9900$ tokens) by engaging professional writers to assess LLM-generated summaries across span, summary, and story levels using narrative-theory grounded criteria. It finds that while top models can produce high-quality summaries, faithfulness and subtext interpretation remain significant challenges, with models accurate in only about $50\%$ of cases and with poor alignment to human judgments when used as evaluators. The work also shows substantial variability with writing style (unreliable narrators, nonlinear timelines) and demonstrates that automatic metrics do not reliably predict human writer ratings. Overall, the paper argues for human-in-the-loop evaluation and collaboration with domain experts to robustly assess narrative understanding and to avoid training-data leakage, offering a practical methodology for future research in long-form narrative AI.
Abstract
We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.
