Table of Contents
Fetching ...

StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

Daniel Oliveira, David Martins de Matos

TL;DR

This work introduces StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles through LCS matching, and fine-tune Qwen Storyteller3 on this dataset, building on prior work in visual grounding and entity re-identification.

Abstract

Visual storytelling models that correctly ground entities in images may still hallucinate semantic relationships, generating incorrect dialogue attribution, character interactions, or emotional states. We introduce StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles through LCS matching. Our alignment pipeline synchronizes screenplay dialogue with subtitle timestamps, enabling dialogue attribution by linking character names from scripts to temporal positions from subtitles. Using this aligned content, we generate stories that maintain visual grounding tags while incorporating authentic character names, dialogue, and relationship dynamics. We fine-tune Qwen Storyteller3 on this dataset, building on prior work in visual grounding and entity re-identification. Evaluation using DeepSeek V3 as judge shows that Storyteller3 achieves an 89.9% win rate against base Qwen2.5-VL 7B on subtitle alignment. Compared to Storyteller, trained without script grounding, Storyteller3 achieves 48.5% versus 38.0%, confirming that semantic alignment progressively improves dialogue attribution beyond visual grounding alone.

StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

TL;DR

This work introduces StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles through LCS matching, and fine-tune Qwen Storyteller3 on this dataset, building on prior work in visual grounding and entity re-identification.

Abstract

Visual storytelling models that correctly ground entities in images may still hallucinate semantic relationships, generating incorrect dialogue attribution, character interactions, or emotional states. We introduce StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles through LCS matching. Our alignment pipeline synchronizes screenplay dialogue with subtitle timestamps, enabling dialogue attribution by linking character names from scripts to temporal positions from subtitles. Using this aligned content, we generate stories that maintain visual grounding tags while incorporating authentic character names, dialogue, and relationship dynamics. We fine-tune Qwen Storyteller3 on this dataset, building on prior work in visual grounding and entity re-identification. Evaluation using DeepSeek V3 as judge shows that Storyteller3 achieves an 89.9% win rate against base Qwen2.5-VL 7B on subtitle alignment. Compared to Storyteller, trained without script grounding, Storyteller3 achieves 48.5% versus 38.0%, confirming that semantic alignment progressively improves dialogue attribution beyond visual grounding alone.
Paper Structure (22 sections, 1 figure, 4 tables)

This paper contains 22 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Example story from the StoryMovie dataset showing entity grounding with script-aligned dialogue. Character names from the screenplay (e.g., Mr. Johnny, Loretta) are linked to visual entities, and dialogue is attributed based on script-subtitle alignment. Loretta is re-identified across frames, appearing as char4 in images 1 and 3 and as char2 in images 2, 4, and 5.