Table of Contents
Fetching ...

Generating Visual Stories with Grounded and Coreferent Characters

Danyang Liu, Mirella Lapata, Frank Keller

TL;DR

The paper tackles the common problem of generic, poorly grounded narratives in visual storytelling by introducing character-centric story generation. It builds VIST++—an automated augmentation of VIST with visual and textual character coreference chains and multimodal alignment—and trains a generation model on this data (SOtter++), enforcing grounding via explicit character mentions linked to visual segments. The approach benefits from a novel LVLM-based coreference pipeline and an LLM-as-Judge evaluation, showing that textual coreference improves character richness and that combining visual and textual coreference yields the strongest coreference. The work demonstrates stronger character grounding and generalizes beyond VIST to the Visual Writing Prompts dataset, suggesting practical impact for more engaging and coherent visual narratives.

Abstract

Characters are important in narratives. They move the plot forward, create emotional connections, and embody the story's themes. Visual storytelling methods focus more on the plot and events relating to it, without building the narrative around specific characters. As a result, the generated stories feel generic, with character mentions being absent, vague, or incorrect. To mitigate these issues, we introduce the new task of character-centric story generation and present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions. Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark. Specifically, we develop an automated pipeline to enrich VIST with visual and textual character coreference chains. We also propose new evaluation metrics to measure the richness of characters and coreference in stories. Experimental results show that our model generates stories with recurring characters which are consistent and coreferent to larger extent compared to baselines and state-of-the-art systems.

Generating Visual Stories with Grounded and Coreferent Characters

TL;DR

The paper tackles the common problem of generic, poorly grounded narratives in visual storytelling by introducing character-centric story generation. It builds VIST++—an automated augmentation of VIST with visual and textual character coreference chains and multimodal alignment—and trains a generation model on this data (SOtter++), enforcing grounding via explicit character mentions linked to visual segments. The approach benefits from a novel LVLM-based coreference pipeline and an LLM-as-Judge evaluation, showing that textual coreference improves character richness and that combining visual and textual coreference yields the strongest coreference. The work demonstrates stronger character grounding and generalizes beyond VIST to the Visual Writing Prompts dataset, suggesting practical impact for more engaging and coherent visual narratives.

Abstract

Characters are important in narratives. They move the plot forward, create emotional connections, and embody the story's themes. Visual storytelling methods focus more on the plot and events relating to it, without building the narrative around specific characters. As a result, the generated stories feel generic, with character mentions being absent, vague, or incorrect. To mitigate these issues, we introduce the new task of character-centric story generation and present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions. Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark. Specifically, we develop an automated pipeline to enrich VIST with visual and textual character coreference chains. We also propose new evaluation metrics to measure the richness of characters and coreference in stories. Experimental results show that our model generates stories with recurring characters which are consistent and coreferent to larger extent compared to baselines and state-of-the-art systems.
Paper Structure (43 sections, 2 equations, 4 figures, 14 tables, 1 algorithm)

This paper contains 43 sections, 2 equations, 4 figures, 14 tables, 1 algorithm.

Figures (4)

  • Figure 1: Examples from the VIST dataset illustrating how the absence of characters or vague character references affect story coherence and engagement. The first story is purely descriptive without any characters, lacking emotional depth and narrative engagement. No human presence means no perspective, making it static and impersonal. In the second story, while a character is mentioned ("a man"), he adds nothing to the story. The man is passive, disconnected from events, and does not drive the narrative, making the story feel just as flat as the first one. The third story fails to refer to the protagonist correctly, switching between "we", "they", and "I", which causes the story to be confusing and illogical.
  • Figure 2: A sample from the VIST dataset huang2016visual augmented with character chains. Visual characters are outlined by segmentation mask boundaries, where same color indicates same characters. Each character bears a unique label overlaid in the center of the mask. Output stories are annotated with textual character chains, and each character mention is aligned to visual characters (e.g., (#4) refers to visual segment 4).
  • Figure 3: Illustration of the incremental clustering algorithm for creating visual coreference chains and prompts used in textual coreference resolution. (a) Character detections in Image$_{k+1}$ are compared against character clusters from the previous $k$ images to generate a visual similarity matrix. Best matching character detections are added to existing clusters. Detections that do not match any clusters are grouped into new clusters. (b) A QA-based prompting method is used to identify entities referring to characters (Step 1). Then character clusters are identified using a structured prompt template, which can handle singular and plural character mentions (Step 2).
  • Figure 4: Examples of system output and human-written story for an image sequence (VIST test set).