Table of Contents
Fetching ...

I Want It That Way! Specifying Nuanced Camera Motions in Video Editing

Pooja Guhan, Divya Kothandaraman, Geonsun Lee, Tsung-Wei Huang, Guan-Ming Su, Dinesh Manocha

TL;DR

This work tackles the challenge of obtaining nuanced camera motion in video editing for non-experts. It introduces a zero-shot camera motion transfer method that transfers motion from a single reference video to a static image using dual LoRA finetuning with an orthogonality constraint and homography-guided refinement, all built on a pretrained text-to-video diffusion model. A new CameraScore metric is proposed to quantify motion transfer fidelity, and the approach is validated through extensive quantitative analyses, ablations, and two user studies, showing superior motion accuracy and scene preservation compared to baselines and significantly improved usability. The results demonstrate a practical, reference-based workflow that democratizes cinematic camera control and informs future development of modular, user-centered generative video tools.

Abstract

Specifying nuanced and compelling camera motion remains a major hurdle for non-expert creators using generative tools, creating an ``expressive gap" where generic text prompts fail to capture cinematic vision. To address this, we present a novel zero-shot diffusion-based system that enables personalized camera motion transfer from a single reference video onto a user-provided static image. Our technical contribution introduces an intuitive interaction paradigm that bypasses the need for 3D data, predefined trajectories, or complex graphical interfaces. The core pipeline leverages a text-to-video diffusion model, employing a two-phase strategy: 1) a multi-concept learning method using LoRA layers and an orthogonality loss to distinctly capture spatial-temporal characteristics and scene features, and 2) a homography-based refinement strategy to enhance temporal and spatial alignment of the generated video. Extensive evaluation demonstrates the efficacy of our method. In a comparative study with 72 participants, our system was significantly preferred over prior work for both motion accuracy (90.45\%) and scene preservation (70.31\%). A second study confirmed our interface significantly improves usability and creative control for video direction. Our work contributes a robust technical solution and a novel human-centered design, significantly expanding cinematic video editing for diverse users.

I Want It That Way! Specifying Nuanced Camera Motions in Video Editing

TL;DR

This work tackles the challenge of obtaining nuanced camera motion in video editing for non-experts. It introduces a zero-shot camera motion transfer method that transfers motion from a single reference video to a static image using dual LoRA finetuning with an orthogonality constraint and homography-guided refinement, all built on a pretrained text-to-video diffusion model. A new CameraScore metric is proposed to quantify motion transfer fidelity, and the approach is validated through extensive quantitative analyses, ablations, and two user studies, showing superior motion accuracy and scene preservation compared to baselines and significantly improved usability. The results demonstrate a practical, reference-based workflow that democratizes cinematic camera control and informs future development of modular, user-centered generative video tools.

Abstract

Specifying nuanced and compelling camera motion remains a major hurdle for non-expert creators using generative tools, creating an ``expressive gap" where generic text prompts fail to capture cinematic vision. To address this, we present a novel zero-shot diffusion-based system that enables personalized camera motion transfer from a single reference video onto a user-provided static image. Our technical contribution introduces an intuitive interaction paradigm that bypasses the need for 3D data, predefined trajectories, or complex graphical interfaces. The core pipeline leverages a text-to-video diffusion model, employing a two-phase strategy: 1) a multi-concept learning method using LoRA layers and an orthogonality loss to distinctly capture spatial-temporal characteristics and scene features, and 2) a homography-based refinement strategy to enhance temporal and spatial alignment of the generated video. Extensive evaluation demonstrates the efficacy of our method. In a comparative study with 72 participants, our system was significantly preferred over prior work for both motion accuracy (90.45\%) and scene preservation (70.31\%). A second study confirmed our interface significantly improves usability and creative control for video direction. Our work contributes a robust technical solution and a novel human-centered design, significantly expanding cinematic video editing for diverse users.

Paper Structure

This paper contains 31 sections, 6 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: This showcases examples to depict scenarios where COLMAP can and cannot work reliably. In video (A), the camera effect observed is due to the explicit movement of the camera while in (B), the effect is obtained due to changes in camera focal length. COLMAP is not able to converge for videos like (B).
  • Figure 2: We present a zero-shot method to transfer camera motion visible in a reference video $V_R$ onto a user provided image $I_u$ to generate video $V_u$. It's a two-phase algorithm. The first phase involves learning multiple concepts associated with the spatial and temporal features of the reference video as well as the spatial characteristics of the user-provided image. We propose the use of a spatial-temporal orthogonality loss to better learn these concepts. The second phase consists of using a homography-based guidance to refine the generated video to preserve the scene in $I_u$ as well as the camera motion obtained from the reference video $V_R$.
  • Figure 3: Graph (a) shows the results of the plot of Video CLIP scores with CameraScore for the different diffusion methods being compared. Graph (b) shows the plot of DINO scores with CameraScore for the different diffusion methods being compared. Our approach achieves the best trade-off.
  • Figure 4: Graph (A) illustrates the relationship between CameraScore and DINO, while Graph (B) shows the relationship between CameraScore and VideoCLIP. These plots are derived from the ablation experiments conducted to evaluate the significance of various key components in our proposed approach.
  • Figure 5: Qualitative results of our method, demonstrating clear improvements over prior work in transferring camera motion, while preserving scene content. More results in the appendix.
  • ...and 3 more figures