MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, Xu Jia
TL;DR
MultiShotMaster addresses the gap in controllable multi-shot video generation by extending a pretrained single-shot diffusion transformer with two RoPE variants: Multi-Shot Narrative RoPE for shot-boundary awareness and Spatiotemporal Position-Aware RoPE for grounded reference injection. It introduces a Multi-Shot & Multi-Reference Attention Mask and an automated data curation pipeline to enable text-driven inter-shot consistency, subject motion control, and background-guided scene customization across variable shot counts and durations. A three-stage training regime, plus evaluation on narrative multi-shot prompts and grounding tasks, demonstrates superior inter-shot coherence, transition accuracy, and grounding fidelity over baselines. The work advances practical, director-level controllable multi-shot video generation and lays groundwork for future scaling and decoupling of camera motion from subject dynamics.
Abstract
Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.
