Table of Contents
Fetching ...

PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention

Yipeng Chen, Zhichao Ye, Zhenzhou Fang, Xinyu Chen, Xiaoyu Zhang, Jialing Liu, Nan Wang, Haomin Liu, Guofeng Zhang

TL;DR

PostCam tackles the problem of post-capture camera-trajectory editing for dynamic scenes by introducing a query-shared cross-attention mechanism that jointly ingests 6-DoF camera poses and rendered video into a shared conditioning space. A two-stage training regime first learns motion from pose cues and then refines motion and appearance with rendered visual information, enabling precise pose control and high-fidelity generation. Across real and synthetic datasets, PostCam outperforms state-of-the-art methods by over 20% in camera-control precision and view consistency, while delivering top-tier video quality. The approach promises robust, editable viewpoint generation for dynamic scenes and will release code and data to support future research.

Abstract

We propose PostCam, a framework for novel-view video generation that enables post-capture editing of camera trajectories in dynamic scenes. We find that existing video recapture methods suffer from suboptimal camera motion injection strategies; such suboptimal designs not only limit camera control precision but also result in generated videos that fail to preserve fine visual details from the source video. To achieve more accurate and flexible motion manipulation, PostCam introduces a query-shared cross-attention module. It integrates two distinct forms of control signals: the 6-DoF camera poses and the 2D rendered video frames. By fusing them into a unified representation within a shared feature space, our model can extract underlying motion cues, which enhances both control precision and generation quality. Furthermore, we adopt a two-stage training strategy: the model first learns coarse camera control from pose inputs, and then incorporates visual information to refine motion accuracy and enhance visual fidelity. Experiments on both real-world and synthetic datasets demonstrate that PostCam outperforms state-of-the-art methods by over 20% in camera control precision and view consistency, while achieving the highest video generation quality. Our project webpage is publicly available at: https://cccqaq.github.io/PostCam.github.io/

PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention

TL;DR

PostCam tackles the problem of post-capture camera-trajectory editing for dynamic scenes by introducing a query-shared cross-attention mechanism that jointly ingests 6-DoF camera poses and rendered video into a shared conditioning space. A two-stage training regime first learns motion from pose cues and then refines motion and appearance with rendered visual information, enabling precise pose control and high-fidelity generation. Across real and synthetic datasets, PostCam outperforms state-of-the-art methods by over 20% in camera-control precision and view consistency, while delivering top-tier video quality. The approach promises robust, editable viewpoint generation for dynamic scenes and will release code and data to support future research.

Abstract

We propose PostCam, a framework for novel-view video generation that enables post-capture editing of camera trajectories in dynamic scenes. We find that existing video recapture methods suffer from suboptimal camera motion injection strategies; such suboptimal designs not only limit camera control precision but also result in generated videos that fail to preserve fine visual details from the source video. To achieve more accurate and flexible motion manipulation, PostCam introduces a query-shared cross-attention module. It integrates two distinct forms of control signals: the 6-DoF camera poses and the 2D rendered video frames. By fusing them into a unified representation within a shared feature space, our model can extract underlying motion cues, which enhances both control precision and generation quality. Furthermore, we adopt a two-stage training strategy: the model first learns coarse camera control from pose inputs, and then incorporates visual information to refine motion accuracy and enhance visual fidelity. Experiments on both real-world and synthetic datasets demonstrate that PostCam outperforms state-of-the-art methods by over 20% in camera control precision and view consistency, while achieving the highest video generation quality. Our project webpage is publicly available at: https://cccqaq.github.io/PostCam.github.io/

Paper Structure

This paper contains 19 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Given a source video (left) and a target camera trajectory (middle), PostCam generates high-quality novel-view videos (right), enabling post-capture editing of camera motion in dynamic scenes. Despite using a lightweight backbone with only 1.3B-parameters, PostCam achieves precise trajectory control and high-fidelity novel-view synthesis across a wide range of video styles and motion patterns.
  • Figure 2: Comparison with State-of-the-Art Methods in Real Dynamic Scenes. We compare the last-frame results of state-of-the-art (SOTA) methods and our approach, with the last frame of the source video shown in the first column for reference. Across all examples, our method consistently preserves fine details (e.g., faces, hands), while SOTA methods often suffer from blurring, distortion, and identity drift. Moreover, our approach remains robust even in challenging cases where SOTA methods fail completely (rows 4--5), producing high-quality and temporally coherent results. For more comparisons, please refer to our supplementary videos.
  • Figure 3: Framework overview. The model stacks multiple transformer blocks along three parallel pathways. (a) The source video is first encoded into latent space and then concatenated with noised latents before entering the transformer blocks. (b) the camera parameters are processed by a lightweight encoder and injected into every block via query-shared cross-attention. (c) the rendered video is similarly encoded into latent space; within each block, it undergoes self-attention to extract high-level visual features. An exploded view (bottom right) depicts the internal structure of a transformer block.
  • Figure 4: Qualitative comparison results. We compare our method against SOTA (column 3,4,5,6) under specific camera motions (column 2, showing direction and render). Source frames (initial, middle, final) are in column 1. Das, an I2V model, fails to generate coherent results. TrajectoryCrafter, relying on rendering, fails when render data is inaccurate (due to depth errors) or incomplete caused by the source video's own motion (e.g., the missing head in scene 2's final frame). ReCamMaster, though render-agnostic, struggles to achieve both quality and motion precision in difficult scenes. Our method robustly generates high-quality results across all scenes, faithfully aligns with source content and achieves superior pose accuracy.
  • Figure 5: Comparisons under different conditions. Our method achieves faithful preservation of fine details and more accurate background generation while ensuring accurate camera motion control. The red boxes highlight that, even when distortions occur in the rendered views, only our results retain fine details, such as the cigarette in the person's hand, while achieving superior pose accuracy.