Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training
Zhenghong Zhou, Jie An, Jiebo Luo
TL;DR
This paper tackles the problem of enabling camera trajectory control in pre-trained video diffusion models without additional training. It introduces Latent-Reframe, a sampling-stage approach that reframes per-frame latent codes through time-aware 3D point clouds and repairs occluded regions via latent code inpainting and harmonization, preserving the model’s original distribution. The method delivers competitive or superior camera-control precision and video quality compared to training-based baselines, as evidenced by FID/FVD metrics and pose errors. By leveraging 3D information during inference, Latent-Reframe broadens controllable video generation while maintaining efficiency and fidelity to the pre-trained model.
Abstract
Precise camera pose control is crucial for video generation with diffusion models. Existing methods require fine-tuning with additional datasets containing paired videos and camera pose annotations, which are both data-intensive and computationally costly, and can disrupt the pre-trained model distribution. We introduce Latent-Reframe, which enables camera control in a pre-trained video diffusion model without fine-tuning. Unlike existing methods, Latent-Reframe operates during the sampling stage, maintaining efficiency while preserving the original model distribution. Our approach reframes the latent code of video frames to align with the input camera trajectory through time-aware point clouds. Latent code inpainting and harmonization then refine the model latent space, ensuring high-quality video generation. Experimental results demonstrate that Latent-Reframe achieves comparable or superior camera control precision and video quality to training-based methods, without the need for fine-tuning on additional datasets.
