Table of Contents
Fetching ...

The Lost Melody: Empirical Observations on Text-to-Video Generation From A Storytelling Perspective

Andrew Shin, Yusuke Mori, Kunitake Kaneko

TL;DR

The paper investigates storytelling as a critical dimension in text-to-video generation, arguing that current models primarily render single scenes rather than coherent narratives. It analyzes three prompt modalities—short stories, scripts, and captions—to build an empirical understanding of storytelling capabilities and limitations, introducing the T2Vid2T evaluation framework that cycles video captions back to text prompts. Key contributions include a structured evaluation of storytelling components (character, setting, plot), a three-prompt generation pipeline, and insights into where current methods fall short, guiding future directions. The findings reveal a gap between high-fidelity visuals and narrative coherence, suggesting directions like reference-based conditioning and global story representations to advance storytelling-aware video synthesis and metric standardization.

Abstract

Text-to-video generation task has witnessed a notable progress, with the generated outcomes reflecting the text prompts with high fidelity and impressive visual qualities. However, current text-to-video generation models are invariably focused on conveying the visual elements of a single scene, and have so far been indifferent to another important potential of the medium, namely a storytelling. In this paper, we examine text-to-video generation from a storytelling perspective, which has been hardly investigated, and make empirical remarks that spotlight the limitations of current text-to-video generation scheme. We also propose an evaluation framework for storytelling aspects of videos, and discuss the potential future directions.

The Lost Melody: Empirical Observations on Text-to-Video Generation From A Storytelling Perspective

TL;DR

The paper investigates storytelling as a critical dimension in text-to-video generation, arguing that current models primarily render single scenes rather than coherent narratives. It analyzes three prompt modalities—short stories, scripts, and captions—to build an empirical understanding of storytelling capabilities and limitations, introducing the T2Vid2T evaluation framework that cycles video captions back to text prompts. Key contributions include a structured evaluation of storytelling components (character, setting, plot), a three-prompt generation pipeline, and insights into where current methods fall short, guiding future directions. The findings reveal a gap between high-fidelity visuals and narrative coherence, suggesting directions like reference-based conditioning and global story representations to advance storytelling-aware video synthesis and metric standardization.

Abstract

Text-to-video generation task has witnessed a notable progress, with the generated outcomes reflecting the text prompts with high fidelity and impressive visual qualities. However, current text-to-video generation models are invariably focused on conveying the visual elements of a single scene, and have so far been indifferent to another important potential of the medium, namely a storytelling. In this paper, we examine text-to-video generation from a storytelling perspective, which has been hardly investigated, and make empirical remarks that spotlight the limitations of current text-to-video generation scheme. We also propose an evaluation framework for storytelling aspects of videos, and discuss the potential future directions.
Paper Structure (16 sections, 7 figures, 5 tables)

This paper contains 16 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Scenes from the video "The Lost Melody" generated from a short story.
  • Figure 2: Overall workflow of generating videos from a short story generated by a large language model.
  • Figure 3: Unless each prompt contains a sufficient amount of details, the model generates incoherent results.
  • Figure 4: Evaluation workflow for T2Vid2T.
  • Figure 5: Example of human-written summary that is substantially different from input.
  • ...and 2 more figures