Table of Contents
Fetching ...

VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, Ser-Nam Lim

TL;DR

VideoGen-of-Thought (VGoT) introduces a training-free, modular pipeline for generating cohesive multi-shot videos from a single sentence. It combines dynamic storyline modeling with self-validation, identity-aware cross-shot propagation using IPP tokens, and adjacent latent transition mechanisms to ensure narrative coherence and visual consistency across shots. Quantitative and human evaluations show substantial gains in within-shot and cross-shot face and style consistency compared to baselines, with significantly reduced manual intervention. The work also provides a new multi-shot evaluation protocol and a 10-story benchmark to assess long-form narrative video generation.

Abstract

Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which turns the user prompt into concise shot drafts and then expands them into detailed specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, and HDR lighting) with self-validation to ensure logical progress. (2) Visual inconsistency: previous approaches struggle to maintain consistent appearance across shots. Our identity-aware cross-shot propagation builds identity-preserving portrait (IPP) tokens that keep character identity while allowing controlled trait changes (expressions, aging) required by the story. (3) Transition artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots' features at transition points, enabling seamless visual flow while preserving narrative continuity. Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4\% in within-shot face consistency and 17.4\% in style consistency, while requiring 10x fewer manual adjustments. VGoT bridges the gap between raw visual synthesis and director-level storytelling for automated multi-shot video generation.

VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

TL;DR

VideoGen-of-Thought (VGoT) introduces a training-free, modular pipeline for generating cohesive multi-shot videos from a single sentence. It combines dynamic storyline modeling with self-validation, identity-aware cross-shot propagation using IPP tokens, and adjacent latent transition mechanisms to ensure narrative coherence and visual consistency across shots. Quantitative and human evaluations show substantial gains in within-shot and cross-shot face and style consistency compared to baselines, with significantly reduced manual intervention. The work also provides a new multi-shot evaluation protocol and a 10-story benchmark to assess long-form narrative video generation.

Abstract

Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which turns the user prompt into concise shot drafts and then expands them into detailed specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, and HDR lighting) with self-validation to ensure logical progress. (2) Visual inconsistency: previous approaches struggle to maintain consistent appearance across shots. Our identity-aware cross-shot propagation builds identity-preserving portrait (IPP) tokens that keep character identity while allowing controlled trait changes (expressions, aging) required by the story. (3) Transition artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots' features at transition points, enabling seamless visual flow while preserving narrative continuity. Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4\% in within-shot face consistency and 17.4\% in style consistency, while requiring 10x fewer manual adjustments. VGoT bridges the gap between raw visual synthesis and director-level storytelling for automated multi-shot video generation.

Paper Structure

This paper contains 26 sections, 24 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Illustration of VideoGen-of-Thought (VGoT).(a) Comparison of existing methods with VGoT in multi-shot video generation. Existing methods struggle with maintaining reasonability and consistency across multiple shots, while VGoT effectively addresses these challenges through a multi-shot generation approach. (b) Challenges solved by VGoT: addressing narrative fragmentation with dynamic storylines modeling across five domains (characters/backgrounds/relations/camera/HDR), tackling visual inconsistency via identity-aware cross-shot propagation to create keyframes using IPP tokens derived from narrative elements, and solving transition artifacts during multi-shot video synthesizes through adjacent latent transition mechanisms.
  • Figure 2: The FlowChart of VideoGen-of-Thought. Left: Shot descriptions are generated based on user prompts, describing various attributes such as character details, background, relations, and camera pose. Pre-shot descriptions provide a broader context for the upcoming scenes. Middle Top: Keyframes are generated using a text-to-image diffusion model conditioned with identity-preserving (IP) embeddings, which ensures consistent representation of characters throughout the shots. IP portraits help maintain visual identity consistency. Right: The shot-level video clips are generated from keyframes, followed by shot-by-shot transition inference to ensure temporal consistency across different shots. This collaborative framework ultimately produces a cohesive narrative-driven video.
  • Figure 3: Visual showcases of VGoT generated multi-shot videos.
  • Figure 4: Visual comparison of VGoT and baselines
  • Figure 5: Visual Demonstration of the ablation studies of VGoT
  • ...and 3 more figures