Table of Contents
Fetching ...

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

Zun Wang, Jaemin Cho, Jialu Li, Han Lin, Jaehong Yoon, Yue Zhang, Mohit Bansal

TL;DR

The paper tackles the challenge of precise 3D-informed camera control in video diffusion models, where traditional anchor videos built from point-cloud reconstructions and camera trajectories suffer from misalignment and annotation bottlenecks. EPiC introduces a visibility-based masking pipeline to construct precisely aligned anchor videos from in-the-wild footage and pairs it with a lightweight Anchor-ControlNet that copies visible content while leaving occluded regions to the backbone to synthesize. This design eliminates the need for ground-truth trajectories, enables training on diverse data, and achieves state-of-the-art performance on RealEstate10K and MiraData for image-to-video camera control, with strong zero-shot generalization to video-to-video tasks. Ablation studies show the advantages of masking-based anchors, artifact-aware training, and visibility-gated conditioning, along with clear efficiency gains in data, compute, and parameter count.

Abstract

Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any in-the-wild video to generate image-to-video (I2V) training pairs. Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained VDMs, with less than 1% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone typically needed to mitigate rendering misalignments. Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds during inference, enabling precise 3D-informed camera control. EPiC achieves SOTA performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video scenarios.

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

TL;DR

The paper tackles the challenge of precise 3D-informed camera control in video diffusion models, where traditional anchor videos built from point-cloud reconstructions and camera trajectories suffer from misalignment and annotation bottlenecks. EPiC introduces a visibility-based masking pipeline to construct precisely aligned anchor videos from in-the-wild footage and pairs it with a lightweight Anchor-ControlNet that copies visible content while leaving occluded regions to the backbone to synthesize. This design eliminates the need for ground-truth trajectories, enables training on diverse data, and achieves state-of-the-art performance on RealEstate10K and MiraData for image-to-video camera control, with strong zero-shot generalization to video-to-video tasks. Ablation studies show the advantages of masking-based anchors, artifact-aware training, and visibility-gated conditioning, along with clear efficiency gains in data, compute, and parameter count.

Abstract

Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any in-the-wild video to generate image-to-video (I2V) training pairs. Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained VDMs, with less than 1% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone typically needed to mitigate rendering misalignments. Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds during inference, enabling precise 3D-informed camera control. EPiC achieves SOTA performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video scenarios.

Paper Structure

This paper contains 31 sections, 2 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Comparison of anchor video creation methods for training camera control models. (a) Previous methods (ren2025gen3cyu2024viewcrafter) estimate the 3D point cloud (through depth estimation) using the first frame and render anchor videos with annotated camera trajectories, but suffer from region misalignment due to point-cloud estimation errors while limited to camera-pose annotated data, resulting in inefficient training. (b) Our method creates anchor videos via visibility masking based on first-frame pixel tracking. This not only guarantees accurate geometric alignment but also supports diverse data while largely reducing training costs. We highlight the video regions in red and green boxes to compare the alignment quality.
  • Figure 2: EPiC Model Architecture. (a) shows an overview of our EPiC framework. EPiC supports multiple inference scenarios. (b) and (c) illustrate our I2V inference scenarios using full and masked point clouds, respectively. (d) depicts V2V inference scenario employing dynamic point clouds.
  • Figure 3: Anchor video construction.
  • Figure 4: Generated videos comparing with other camera control methods for I2V and V2V tasks.
  • Figure 5: Qualitative examples for ablation study.
  • ...and 9 more figures