Table of Contents
Fetching ...

Generative Photographic Control for Scene-Consistent Video Cinematic Editing

Huiqiang Sun, Liao Shen, Zhan Peng, Kun Wang, Size Wu, Yuhang Zang, Tianqi Liu, Zihao Huang, Xingyu Zeng, Zhiguo Cao, Wei Li, Chen Change Loy

TL;DR

This work tackles the challenge of editing videos with fine-grained photographic effects while preserving cinematic coherence. It introduces CineCtrl, a V2V editing framework built on a pre-trained diffusion backbone, augmented with a Camera-Decoupled Cross-Attention mechanism to separately encode camera trajectory and photographic controls (e.g., $K$, $d_f$, $f$, $S$, $T$). A two-fold data strategy combines physics-based photographic-effect simulation on synthetic data with a curated real-world dataset, totaling 170k synthetic and 32k real clips, to train robustly. Empirical results show CineCtrl achieves precise control over photographic parameters, maintains high video quality and scene consistency, and is preferred in user studies over baselines. This work enables practical, studio-grade cinematic edits within generative video pipelines and opens avenues for automated aesthetic planning in video generation.

Abstract

Cinematic storytelling is profoundly shaped by the artful manipulation of photographic elements such as depth of field and exposure. These effects are crucial in conveying mood and creating aesthetic appeal. However, controlling these effects in generative video models remains highly challenging, as most existing methods are restricted to camera motion control. In this paper, we propose CineCtrl, the first video cinematic editing framework that provides fine control over professional camera parameters (e.g., bokeh, shutter speed). We introduce a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs, allowing fine-grained, independent control without compromising scene consistency. To overcome the shortage of training data, we develop a comprehensive data generation strategy that leverages simulated photographic effects with a dedicated real-world collection pipeline, enabling the construction of a large-scale dataset for robust model training. Extensive experiments demonstrate that our model generates high-fidelity videos with precisely controlled, user-specified photographic camera effects.

Generative Photographic Control for Scene-Consistent Video Cinematic Editing

TL;DR

This work tackles the challenge of editing videos with fine-grained photographic effects while preserving cinematic coherence. It introduces CineCtrl, a V2V editing framework built on a pre-trained diffusion backbone, augmented with a Camera-Decoupled Cross-Attention mechanism to separately encode camera trajectory and photographic controls (e.g., , , , , ). A two-fold data strategy combines physics-based photographic-effect simulation on synthetic data with a curated real-world dataset, totaling 170k synthetic and 32k real clips, to train robustly. Empirical results show CineCtrl achieves precise control over photographic parameters, maintains high video quality and scene consistency, and is preferred in user studies over baselines. This work enables practical, studio-grade cinematic edits within generative video pipelines and opens avenues for automated aesthetic planning in video generation.

Abstract

Cinematic storytelling is profoundly shaped by the artful manipulation of photographic elements such as depth of field and exposure. These effects are crucial in conveying mood and creating aesthetic appeal. However, controlling these effects in generative video models remains highly challenging, as most existing methods are restricted to camera motion control. In this paper, we propose CineCtrl, the first video cinematic editing framework that provides fine control over professional camera parameters (e.g., bokeh, shutter speed). We introduce a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs, allowing fine-grained, independent control without compromising scene consistency. To overcome the shortage of training data, we develop a comprehensive data generation strategy that leverages simulated photographic effects with a dedicated real-world collection pipeline, enabling the construction of a large-scale dataset for robust model training. Extensive experiments demonstrate that our model generates high-fidelity videos with precisely controlled, user-specified photographic camera effects.

Paper Structure

This paper contains 20 sections, 20 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Examples of fine-grained photographic control with our CineCtrl. The source video is edited into generated outputs with independently adjusted photographic parameters: bokeh (blur intensity $K$ and refocused disparity $d_f$), exposure (shutter speed $S$), color tone (color temperature $T$), and zoom (focal length $f$), as well as novel camera trajectories. CineCtrl enables precise and disentangled manipulation of these cinematic effects while preserving scene consistency.
  • Figure 2: Overall framework of CineCtrl, which is built upon the Wan$2.1$ T$2$V framework, and extended to a V$2$V model. To enable camera control, we inject both camera trajectory and photographic parameter signals into the DiT block. Through our proposed Camera-Decoupled Cross-Attention mechanism, we disentangle these two signals to achieve accurate and independent control.
  • Figure 3: Illustration of the dataset construction. We generate training pairs by applying our proposed photographic effect simulator to both a synthetic dataset and a high-quality real-world dataset, which we curated from web and movie sources through a shot detection and filtering pipeline.
  • Figure 4: Comparisons with other baselines. Results demonstrate that CineCtrl achieves fine-grained camera parameter control with high visual quality of output videos.
  • Figure 5: Qualitative ablation study. Without Decoupled CA, output videos exhibit noticeable visual artifacts. Besides, control over the bokeh focal plane becomes unreliable when trained without the real-world dataset or using a naïve data parameter setting.
  • ...and 4 more figures