Table of Contents
Fetching ...

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

Junjie Li, Xinrui Guo, Yuhao Wu, Roy Ka-Wei Lee, Hongzhi Li, Yutao Xie

TL;DR

ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation, and ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence, which can inform future efforts to improve consistency in long-form narrative generation.

Abstract

What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token-level entropy, and certain error types tend to co-occur. These findings can inform future efforts to improve consistency in long-form narrative generation. Our project page is available at https://picrew.github.io/constory-bench.github.io/.

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

TL;DR

ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation, and ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence, which can inform future efforts to improve consistency in long-form narrative generation.

Abstract

What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token-level entropy, and certain error types tend to co-occur. These findings can inform future efforts to improve consistency in long-form narrative generation. Our project page is available at https://picrew.github.io/constory-bench.github.io/.
Paper Structure (49 sections, 9 equations, 17 figures, 10 tables)

This paper contains 49 sections, 9 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Overview of ConStory-Bench. The framework comprises three components: (a) a 2,000-prompt benchmark for long story generation (Targeting 8,000--10,000 words), (b) ConStory-Checker, a three-stage pipeline that extracts errors across five categories, pairs contradictions, and constructs evidence chains, and (c) standardized scoring via Consistency Error Density (CED) and Group Relative Rank (GRR).
  • Figure 2: Representative consistency error examples sampled from real LLM-generated stories on ConStory-Bench. Highlighted segments show contradictions in Timeline & Plot Logic, Characterization, World-building & Setting, Factual & Detail Consistency, and Narrative & Style.
  • Figure 3: Output length distribution across representative models. Stacked bars show the proportion of 0--3K, 3K--6K, and 6K+ word outputs.
  • Figure 4: Consistency error growth across different story lengths for two models. Lines: Average error count per story at each length bin (cf. "Errors" in Table \ref{['tab:comprehensive-performance']}); Bars: Number of samples in each bin.
  • Figure 5: Correlation matrix of error categories across all model outputs. Higher values (darker blue) indicate stronger co-occurrence of error types.
  • ...and 12 more figures