Generative Photographic Control for Scene-Consistent Video Cinematic Editing
Huiqiang Sun, Liao Shen, Zhan Peng, Kun Wang, Size Wu, Yuhang Zang, Tianqi Liu, Zihao Huang, Xingyu Zeng, Zhiguo Cao, Wei Li, Chen Change Loy
TL;DR
This work tackles the challenge of editing videos with fine-grained photographic effects while preserving cinematic coherence. It introduces CineCtrl, a V2V editing framework built on a pre-trained diffusion backbone, augmented with a Camera-Decoupled Cross-Attention mechanism to separately encode camera trajectory and photographic controls (e.g., $K$, $d_f$, $f$, $S$, $T$). A two-fold data strategy combines physics-based photographic-effect simulation on synthetic data with a curated real-world dataset, totaling 170k synthetic and 32k real clips, to train robustly. Empirical results show CineCtrl achieves precise control over photographic parameters, maintains high video quality and scene consistency, and is preferred in user studies over baselines. This work enables practical, studio-grade cinematic edits within generative video pipelines and opens avenues for automated aesthetic planning in video generation.
Abstract
Cinematic storytelling is profoundly shaped by the artful manipulation of photographic elements such as depth of field and exposure. These effects are crucial in conveying mood and creating aesthetic appeal. However, controlling these effects in generative video models remains highly challenging, as most existing methods are restricted to camera motion control. In this paper, we propose CineCtrl, the first video cinematic editing framework that provides fine control over professional camera parameters (e.g., bokeh, shutter speed). We introduce a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs, allowing fine-grained, independent control without compromising scene consistency. To overcome the shortage of training data, we develop a comprehensive data generation strategy that leverages simulated photographic effects with a dedicated real-world collection pipeline, enabling the construction of a large-scale dataset for robust model training. Extensive experiments demonstrate that our model generates high-fidelity videos with precisely controlled, user-specified photographic camera effects.
