Table of Contents
Fetching ...

Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence

Nikolai Ilinykh, Hyewon Jang, Shalom Lappin, Asad Sayeed, Sharid Loáiciga

Abstract

We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at https://github.com/GU-CLASP/coherence-driven-humans.

Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence

Abstract

We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at https://github.com/GU-CLASP/coherence-driven-humans.

Paper Structure

This paper contains 31 sections, 6 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Example visual story sequence with numbered story and character images. Four short excerpts are shown from human and GPT-4o outputs under the short and long prompt conditions. Both humans and models have access to the full sequence of images at once. The [SEP] marker indicates segment boundaries. Each segment is a chunk of text about a single image, in the same left-to-right order as the numbered images. Segments are used as the unit of analysis in several of our metrics.
  • Figure 2: Implicit discourse relation type composition, short prompt. Each bar shows the mean within-story proportion of predicted implicit relation types, averaged across stories (and displayed as 100% stacked bars).
  • Figure 3: Topic switch under topic space compression. The highlighted region ($\mathbf{nr\_topics}=15$ to $5$) marks a pronounced drop in human topic switch.
  • Figure :