Table of Contents
Fetching ...

ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

Songlin Yang, Zhe Wang, Xuyi Yang, Songchun Zhang, Xianghao Kong, Taiyi Wu, Xiaotong Zhao, Ran Zhang, Alan Zhao, Anyi Rao

Abstract

Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.

ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

Abstract

Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.
Paper Structure (15 sections, 5 equations, 4 figures, 7 tables, 2 algorithms)

This paper contains 15 sections, 5 equations, 4 figures, 7 tables, 2 algorithms.

Figures (4)

  • Figure 1: Cinematic, Camera-Controlled, Multi-Shot Video Creation via our ShotVerse Framework. (i) Multi-Shot Data Foundation: We curate ShotVerse-Bench dataset from high-production cinema and propose a novel calibration pipeline that aligns disjoint shot trajectories into a unified global coordinate system. (ii) "Plan-then-Control" Framework: A VLM-based Planner automates the plotting of explicit, unified, cinematic trajectories from prompts, which serve as precise guidance for the Controller to synthesize content. (iii) Superior Performance: Examples demonstrate high-fidelity and great camera-controlled generation across diverse genres. The inset 3D plots visualize the plotted explicit trajectories.
  • Figure 2: Method Overview. (i) Dataset Curation. We construct the ShotVerse-Bench by aligning multi-shot trajectories into a unified global coordinate system via camera calibration, paired with hierarchical global and per-shot captions. (ii) Trajectory Plotting: The Planner utilizes a VLM to process the hierarchical prompt interleaved with learnable trajectory query tokens. These inputs are encoded into context-aware embeddings and transformed into explicit camera poses via a Trajectory Decoder and a Pose De-Tokenizer. (iii) Trajectory Injection: The Controller synthesizes high-fidelity videos using a holistic DiT backbone. It precisely follows the trajectories via a Camera Adapter and a 4D Rotary Positional Embedding strategy.
  • Figure 3: Comparisons with the State-of-the-Art Baseline Methods. Early camera-controlled text-driven generation models (e.g. , CameraCtrl, MotionCtrl) struggle to handle complex cinematic camera trajectories. ReCamMaster executes the trajectory but drifts away from the subject in Shot 1. HoloCine, MultiShotMaster, Sora2, VEO3, and Kling3.0, and Seedance2.0 fail to execute the complex "orbit" command, remaining nearly static. These failures demonstrate that for text-driven models, scaling up caption density is insufficient to achieve precise control without explicit geometric guidance.
  • Figure 4: Qualitative Ablation Study. (a) Camera encoder is vital for viewpoint grounding; without it, the model fails to maintain subject orientation (e.g., frontal faces). (b) High-noise pose injection already establishes the global motion scaffold, while adding low-noise injection yields marginal gains. (c) 4D RoPE ensures better shot-cutting stability over 3D RoPE. (d) Without calibration, the camera/trajectory is not globally aligned across shots, causing inaccurate subject tracking. (e) Training on synthetic triplets further makes both the character and the environment look synthetic, and the domain gap to real videos degrades visual quality and temporal stability.