Table of Contents
Fetching ...

CamCloneMaster: Enabling Reference-based Camera Control for Video Generation

Yawen Luo, Jianhong Bai, Xiaoyu Shi, Menghan Xia, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Tianfan Xue

TL;DR

CamCloneMaster introduces a reference-based, parameter-free approach to cloning camera motion from reference videos, unifying image-to-video and video-to-video generation within a single diffusion-based model. It uses a simple token-concatenation mechanism to inject camera-motion and content conditioning directly into the latent diffusion process, and finetunes only 3D spatio-temporal attention layers to preserve generative capabilities. A large Unreal Engine 5–based Camera Clone Dataset supports learning diverse camera trajectories and dynamic scenes, enabling state-of-the-art performance on both I2V and V2V tasks as demonstrated by quantitative metrics and user studies. The work offers a practical, intuitive tool for cinematographers and content creators, with the dataset and method facilitating future research in camera-controlled video synthesis.

Abstract

Camera control is crucial for generating expressive and cinematic videos. Existing methods rely on explicit sequences of camera parameters as control conditions, which can be cumbersome for users to construct, particularly for intricate camera movements. To provide a more intuitive camera control method, we propose CamCloneMaster, a framework that enables users to replicate camera movements from reference videos without requiring camera parameters or test-time fine-tuning. CamCloneMaster seamlessly supports reference-based camera control for both Image-to-Video and Video-to-Video tasks within a unified framework. Furthermore, we present the Camera Clone Dataset, a large-scale synthetic dataset designed for camera clone learning, encompassing diverse scenes, subjects, and camera movements. Extensive experiments and user studies demonstrate that CamCloneMaster outperforms existing methods in terms of both camera controllability and visual quality.

CamCloneMaster: Enabling Reference-based Camera Control for Video Generation

TL;DR

CamCloneMaster introduces a reference-based, parameter-free approach to cloning camera motion from reference videos, unifying image-to-video and video-to-video generation within a single diffusion-based model. It uses a simple token-concatenation mechanism to inject camera-motion and content conditioning directly into the latent diffusion process, and finetunes only 3D spatio-temporal attention layers to preserve generative capabilities. A large Unreal Engine 5–based Camera Clone Dataset supports learning diverse camera trajectories and dynamic scenes, enabling state-of-the-art performance on both I2V and V2V tasks as demonstrated by quantitative metrics and user studies. The work offers a practical, intuitive tool for cinematographers and content creators, with the dataset and method facilitating future research in camera-controlled video synthesis.

Abstract

Camera control is crucial for generating expressive and cinematic videos. Existing methods rely on explicit sequences of camera parameters as control conditions, which can be cumbersome for users to construct, particularly for intricate camera movements. To provide a more intuitive camera control method, we propose CamCloneMaster, a framework that enables users to replicate camera movements from reference videos without requiring camera parameters or test-time fine-tuning. CamCloneMaster seamlessly supports reference-based camera control for both Image-to-Video and Video-to-Video tasks within a unified framework. Furthermore, we present the Camera Clone Dataset, a large-scale synthetic dataset designed for camera clone learning, encompassing diverse scenes, subjects, and camera movements. Extensive experiments and user studies demonstrate that CamCloneMaster outperforms existing methods in terms of both camera controllability and visual quality.

Paper Structure

This paper contains 15 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Camera Control results of CamCloneMaster. CamCloneMaster is capable of cloning camera motion from reference videos without requiring camera parameters or test-time fine-tuning, which also unifies camera-controlled image-to-video generation and video-to-video re-generation within a single model. For V2V re-generation, the downsized content reference video is positioned beside the prompt. We highly encourage readers to check our demo video for video results, which cannot be well demonstrated by still images.
  • Figure 2: Overview of our proposed CamCloneMaster. Given a camera motion reference video and an optional content reference video as inputs, $3$D VAE encoder is utilized to convert reference videos into conditional latents $z_\textrm{cam}$ and $z_\textrm{cont}$. We inject the conditional latents into the model by concatenating them with the noise latent along the frame dimension. And only $3$D spatial-temporal attention layers in DiT Blocks are trainable modules in the training process.
  • Figure 3: Dataset Construction Illustration. We collect several $3$D scenes as background, and put characters into scenes as foreground, each character is combined with a specific animation. Then, multiple paired camera trajectories are designed and shots are made by rendering in Unreal Engine 5.
  • Figure 4: Quantitative Results for Camera-Controlled Image-to-Video Generation. Camera poses are estimated using MegaSam for parameter-based methods.
  • Figure 5: Quantitative Results for Camera-Controlled V2V Re-Generation. Camera poses are estimated using MegaSam for parameter-based methods.