Table of Contents
Fetching ...

Mary, the Cheeseburger-Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?

Karin de Langis, Püren Öncel, Ryan Peters, Andrew Elfenbein, Laura Kristen Allen, Andreas Schramm, Dongyeop Kang

TL;DR

This study probes whether LLMs can reliably detect narrative incoherence by comparing internal signals (perplexity and hidden-state probes) with external judgments (coherence/quality ratings and True/False tasks) across six models and prompt variations. It uses a paired narrative dataset with two incoherence types (event-setting and trait-behavior) augmented from 18 to 25 story pairs to enable precise comparisons. The results show that LLMs’ internal representations distinguish incoherence at the manipulation point, but this sensitivity largely dissolves by the story end, and explicit judgments are only weakly informative; reasoning prompts yield only partial metacognitive insight. The findings reveal a dissociation between internal coherence monitoring and external reporting, highlighting a gap in robust narrative comprehension and bearing on trustworthiness and educational uses of LLMs; the dataset provides a resource for future work on narrative understanding in language models.

Abstract

Leveraging a dataset of paired narratives, we investigate the extent to which large language models (LLMs) can reliably separate incoherent and coherent stories. A probing study finds that LLMs' internal representations can reliably identify incoherent narratives. However, LLMs generate responses to rating questions that fail to satisfactorily separate the coherent and incoherent narratives across several prompt variations, hinting at a gap in LLM's understanding of storytelling. The reasoning LLMs tested do not eliminate these deficits, indicating that thought strings may not be able to fully address the discrepancy between model internal state and behavior. Additionally, we find that LLMs appear to be more sensitive to incoherence resulting from an event that violates the setting (e.g., a rainy day in the desert) than to incoherence arising from a character violating an established trait (e.g., Mary, a vegetarian, later orders a cheeseburger), suggesting that LLMs may rely more on prototypical world knowledge than building meaning-based narrative coherence. The consistent asymmetry found in our results suggests that LLMs do not have a complete grasp on narrative coherence.

Mary, the Cheeseburger-Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?

TL;DR

This study probes whether LLMs can reliably detect narrative incoherence by comparing internal signals (perplexity and hidden-state probes) with external judgments (coherence/quality ratings and True/False tasks) across six models and prompt variations. It uses a paired narrative dataset with two incoherence types (event-setting and trait-behavior) augmented from 18 to 25 story pairs to enable precise comparisons. The results show that LLMs’ internal representations distinguish incoherence at the manipulation point, but this sensitivity largely dissolves by the story end, and explicit judgments are only weakly informative; reasoning prompts yield only partial metacognitive insight. The findings reveal a dissociation between internal coherence monitoring and external reporting, highlighting a gap in robust narrative comprehension and bearing on trustworthiness and educational uses of LLMs; the dataset provides a resource for future work on narrative understanding in language models.

Abstract

Leveraging a dataset of paired narratives, we investigate the extent to which large language models (LLMs) can reliably separate incoherent and coherent stories. A probing study finds that LLMs' internal representations can reliably identify incoherent narratives. However, LLMs generate responses to rating questions that fail to satisfactorily separate the coherent and incoherent narratives across several prompt variations, hinting at a gap in LLM's understanding of storytelling. The reasoning LLMs tested do not eliminate these deficits, indicating that thought strings may not be able to fully address the discrepancy between model internal state and behavior. Additionally, we find that LLMs appear to be more sensitive to incoherence resulting from an event that violates the setting (e.g., a rainy day in the desert) than to incoherence arising from a character violating an established trait (e.g., Mary, a vegetarian, later orders a cheeseburger), suggesting that LLMs may rely more on prototypical world knowledge than building meaning-based narrative coherence. The consistent asymmetry found in our results suggests that LLMs do not have a complete grasp on narrative coherence.

Paper Structure

This paper contains 30 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: We use paired coherent and incoherent narratives to investigate the extent to which LLMs maintain coherence when tracking fictional entities. Internal measures differentiate coherent from incoherent stories at the incoherent location, yet LLMs' explicit coherent ratings at the end of the story fail to do so.
  • Figure 2: An exemplary incoherent-coherent story pair. Both versions share an identical introduction and conclusion, but differ in the middle situation: the incoherent version includes the inconsistent situation. Note the boldface target sentence, which conflicts with the inconsistent situation only. Some introductory text is omitted for brevity, denoted by (...).
  • Figure 3: We identify and annotate two types of incoherence in the dataset: event-setting and trait-behavior. They differ in the target of the violation; specifically, trait-behavior is incoherent because it violates a strongly established character trait, while event-setting is incoherent because it violates some facet of the established narrative scene (e.g., the geographical location or the social norms).
  • Figure 4: Mean accuracies across 10-fold cross-validation for probing hidden state representations to identify incoherent narratives. We probe both at the end of the target sentence that contains the incoherence in the incoherent story version (target), and at the conclusion of the story (end). The $x$ axis denotes model layer. All models show strong separation at the target location, but by the story's end, separation is notably weaker, with smaller models in particular near chance ($\approx 50 - 60\%$ accuracy). Llama3.1-70B has the best performance at the story's end, and it also demonstrates the best understanding of coherence in responses to rating questions.
  • Figure 5: LLMs assign higher perplexity (i.e., lower likelihood) to incoherent events relative to coherent ones.
  • ...and 3 more figures