VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

Mingzhe Zheng; Yongqi Xu; Haojian Huang; Xuran Ma; Yexin Liu; Wenjie Shu; Yatian Pang; Feilong Tang; Qifeng Chen; Harry Yang; Ser-Nam Lim

VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, Ser-Nam Lim

TL;DR

VideoGen-of-Thought (VGoT) introduces a training-free, modular pipeline for generating cohesive multi-shot videos from a single sentence. It combines dynamic storyline modeling with self-validation, identity-aware cross-shot propagation using IPP tokens, and adjacent latent transition mechanisms to ensure narrative coherence and visual consistency across shots. Quantitative and human evaluations show substantial gains in within-shot and cross-shot face and style consistency compared to baselines, with significantly reduced manual intervention. The work also provides a new multi-shot evaluation protocol and a 10-story benchmark to assess long-form narrative video generation.

Abstract

Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which turns the user prompt into concise shot drafts and then expands them into detailed specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, and HDR lighting) with self-validation to ensure logical progress. (2) Visual inconsistency: previous approaches struggle to maintain consistent appearance across shots. Our identity-aware cross-shot propagation builds identity-preserving portrait (IPP) tokens that keep character identity while allowing controlled trait changes (expressions, aging) required by the story. (3) Transition artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots' features at transition points, enabling seamless visual flow while preserving narrative continuity. Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4\% in within-shot face consistency and 17.4\% in style consistency, while requiring 10x fewer manual adjustments. VGoT bridges the gap between raw visual synthesis and director-level storytelling for automated multi-shot video generation.

VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

TL;DR

Abstract

VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)