Table of Contents
Fetching ...

Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

Min-Jung Kim, Jeongho Kim, Hoiyeong Jin, Junha Hyung, Jaegul Choo

TL;DR

InfCam introduces a depth-free conditioning method for camera-controlled video generation by embedding rotation into the latent space via infinite homography, and learning residual parallax to achieve high pose fidelity without reliance on depth estimates. A dual data-augmentation pipeline—trajectory and intrinsic augmentation—diversifies camera motions and focal lengths to improve generalization from synthetic to real-world data. The approach features a warping module that applies H_infty, and a homography-guided attention mechanism that enforces rotation-aware reasoning, all trained on a frozen Wan2.1 backbone with newly learned components. Empirical results on AugMCV and WebVid show state-of-the-art pose accuracy and visual quality, with ablations highlighting the necessity of both the warping module and augmentation strategy. Overall, InfCam provides robust, depth-free camera control for video generation with strong cross-domain generalization.

Abstract

Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory-video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose-faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models. To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data. Link to our project page:https://emjay73.github.io/InfCam/

Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

TL;DR

InfCam introduces a depth-free conditioning method for camera-controlled video generation by embedding rotation into the latent space via infinite homography, and learning residual parallax to achieve high pose fidelity without reliance on depth estimates. A dual data-augmentation pipeline—trajectory and intrinsic augmentation—diversifies camera motions and focal lengths to improve generalization from synthetic to real-world data. The approach features a warping module that applies H_infty, and a homography-guided attention mechanism that enforces rotation-aware reasoning, all trained on a frozen Wan2.1 backbone with newly learned components. Empirical results on AugMCV and WebVid show state-of-the-art pose accuracy and visual quality, with ablations highlighting the necessity of both the warping module and augmentation strategy. Overall, InfCam provides robust, depth-free camera control for video generation with strong cross-domain generalization.

Abstract

Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory-video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose-faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models. To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data. Link to our project page:https://emjay73.github.io/InfCam/

Paper Structure

This paper contains 41 sections, 4 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: InfCam Results. Given a video and a target camera trajectory, InfCam generates a video that faithfully follows the specified camera path. The world coordinate origin is defined by the first frame's camera pose (highlighted in red). The leftmost column visualizes the backward, arc, and rotational camera trajectories, and the right side shows input–generated video pairs corresponding to each trajectory. The rotational trajectory is generated with a shorter focal length to illustrate wide field-of-view generation. The black dashed box in the last row indicates the original field-of-view of the input video.
  • Figure 2: (a) Infinite homography-based approach (ours). By conditioning on images warped by $\mathbf{H}{\infty}$, the model focuses on learning the parallax relative to the plane at infinity. This parallax is confined to the region between the epipole $\mathbf{e}'$ and $\mathbf{x}_{\infty}$, as visualized by the yellow segment on the epipolar line $l'$. This spatial constraint helps the model to achieve higher camera pose fidelity with reduced search space. End-to-end training enables the network to implicitly refine the 3D geometry, correcting inaccuracies in $\mathbf{X}$. (b) Reprojection-based approach. Inaccuracies in depth estimation lead to unreliable conditions, causing artifacts in the generated image. Since no gradients flow through the depth estimation network, the incorrect reprojection position $\mathbf{x'}$ remains fixed during training, hindering error correction.
  • Figure 3: Model Architecture Overview. Our model builds upon Wan2.1, training only newly introduced parameters while freezing pretrained weights. (a) DiT block with homography-guided self-attention layer. Homography-guided self-attention layer takes source, target, and warped latents combined with camera embeddings as input, and performs per-frame attention, ensuring temporal alignment. By conditioning on warped latents, the model enables rotation-aware reasoning and constrained parallax estimation. Only source and target latents proceed to the subsequent Wan2.1 layers. (b) Warping module. This module warps the input latent with infinite homography to handle rotation, then add camera embeddings for translation. This decomposition simplifies reprojection to parallax estimation relative to plane at infinity, enabling higher camera trajectory fidelity.
  • Figure 4: Visualization of synchronized multi-view synthetic video datasets. Different trajectories are visualized in different colors. (a) SynCamVideo. Captured with stationary cameras placed at distinct positions. (b) MultiCamVideo. Captured with dynamic cameras following diverse trajectories, all sharing the same initial frame. (c) AugMCV. An augmented version of MultiCamVideo with varied starting poses and different focal lengths.
  • Figure 5: Qualitative Comparison. (a) shows results on the test split of the AugMCV dataset, and (b) presents results on the WebVid dataset. In both cases, GEN3C and ReCamMaster fail to perform proper viewpoint transformations, largely preserving the initial frame of the source video. ReCamMaster further fails to reflect pose changes in the initial frame even when trajectory interpolation is applied (ReCamMaster w/ Interp). TrajectoryCrafter introduces artifacts due to inaccurate reprojection (highlighted in the orange box). In contrast, our method achieves natural pose transitions while maintaining high visual quality throughout the sequence. Best viewed in zoom.
  • ...and 6 more figures