I Want It That Way! Specifying Nuanced Camera Motions in Video Editing
Pooja Guhan, Divya Kothandaraman, Geonsun Lee, Tsung-Wei Huang, Guan-Ming Su, Dinesh Manocha
TL;DR
This work tackles the challenge of obtaining nuanced camera motion in video editing for non-experts. It introduces a zero-shot camera motion transfer method that transfers motion from a single reference video to a static image using dual LoRA finetuning with an orthogonality constraint and homography-guided refinement, all built on a pretrained text-to-video diffusion model. A new CameraScore metric is proposed to quantify motion transfer fidelity, and the approach is validated through extensive quantitative analyses, ablations, and two user studies, showing superior motion accuracy and scene preservation compared to baselines and significantly improved usability. The results demonstrate a practical, reference-based workflow that democratizes cinematic camera control and informs future development of modular, user-centered generative video tools.
Abstract
Specifying nuanced and compelling camera motion remains a major hurdle for non-expert creators using generative tools, creating an ``expressive gap" where generic text prompts fail to capture cinematic vision. To address this, we present a novel zero-shot diffusion-based system that enables personalized camera motion transfer from a single reference video onto a user-provided static image. Our technical contribution introduces an intuitive interaction paradigm that bypasses the need for 3D data, predefined trajectories, or complex graphical interfaces. The core pipeline leverages a text-to-video diffusion model, employing a two-phase strategy: 1) a multi-concept learning method using LoRA layers and an orthogonality loss to distinctly capture spatial-temporal characteristics and scene features, and 2) a homography-based refinement strategy to enhance temporal and spatial alignment of the generated video. Extensive evaluation demonstrates the efficacy of our method. In a comparative study with 72 participants, our system was significantly preferred over prior work for both motion accuracy (90.45\%) and scene preservation (70.31\%). A second study confirmed our interface significantly improves usability and creative control for video direction. Our work contributes a robust technical solution and a novel human-centered design, significantly expanding cinematic video editing for diverse users.
