Table of Contents
Fetching ...

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, Xu Jia

TL;DR

MultiShotMaster addresses the gap in controllable multi-shot video generation by extending a pretrained single-shot diffusion transformer with two RoPE variants: Multi-Shot Narrative RoPE for shot-boundary awareness and Spatiotemporal Position-Aware RoPE for grounded reference injection. It introduces a Multi-Shot & Multi-Reference Attention Mask and an automated data curation pipeline to enable text-driven inter-shot consistency, subject motion control, and background-guided scene customization across variable shot counts and durations. A three-stage training regime, plus evaluation on narrative multi-shot prompts and grounding tasks, demonstrates superior inter-shot coherence, transition accuracy, and grounding fidelity over baselines. The work advances practical, director-level controllable multi-shot video generation and lays groundwork for future scaling and decoupling of camera motion from subject dynamics.

Abstract

Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

TL;DR

MultiShotMaster addresses the gap in controllable multi-shot video generation by extending a pretrained single-shot diffusion transformer with two RoPE variants: Multi-Shot Narrative RoPE for shot-boundary awareness and Spatiotemporal Position-Aware RoPE for grounded reference injection. It introduces a Multi-Shot & Multi-Reference Attention Mask and an automated data curation pipeline to enable text-driven inter-shot consistency, subject motion control, and background-guided scene customization across variable shot counts and durations. A three-stage training regime, plus evaluation on narrative multi-shot prompts and grounding tasks, demonstrates superior inter-shot coherence, transition accuracy, and grounding fidelity over baselines. The work advances practical, director-level controllable multi-shot video generation and lays groundwork for future scaling and decoupling of camera motion from subject dynamics.

Abstract

Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

Paper Structure

This paper contains 25 sections, 3 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: We propose MultiShotMaster, the first controllable multi-shot video generation framework that supports text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot counts and shot durations are variable. Only the global caption of the first case is shown for brevity.
  • Figure 2: Overview of MultiShotMaster. We extend a pretrained single-shot T2V model by two key RoPE variants: Multi-Shot Narrative RoPE for flexible shot arrangement with temporal narrative order, and Spatiotemporal Position-Aware RoPE for grounded reference injection. To manage in-context information flows, we design a Multi-Shot & Multi-Reference Attention Mask. We finetune temporal attention, cross attention and FFN, leveraging the intrinsic architectural properties to achieve flexible and controllable multi-shot video generation.
  • Figure 3: Data Curation Pipeline: (1) We employ a shot transition detection model soucek2024transnet to cut the collected long videos into short clips, use a scene segmentation model wu2022scene to cluster clips within the same scene, and then sample multi-shot videos. (2) We introduce a hierarchical caption structure and use Gemini-2.5 comanici2025gemini in a two-stage process to produce global caption and per-shot captions. (3) We integrate YOLOv11 khanam2024yolov11, ByteTrack zhang2022bytetrack and SAM kirillov2023segment to detect, track and segment subject images. Then we use Gemini-2.5 to merge the per-shot tracking results by subject appearance. We obtain clean backgrounds by using OmniEraser wei2025omnieraser.
  • Figure 4: Qualitative Comparisons. We compare with two multi-shot video generation methods wu2025cinetranswang2025echoshot in the upper part, and two single-shot reference-to-video methods vaceliu2025phantom under multi-shot setting in the lower part. [ ] denotes the placeholder of character descriptions for baselines. The character introductions of the bottom part are omitted for brevity.
  • Figure 5: Limitation visualization. We only explicitly control the subject motion, while the camera position is controlled by text prompts, which might cause the motion coupling issue.
  • ...and 4 more figures