Table of Contents
Fetching ...

CamDirector: Towards Long-Term Coherent Video Trajectory Editing

Zhihao Shi, Kejia Yin, Weilin Wan, Yuhongze Zhou, Yuanhao Yu, Xinxin Zuo, Qiang Sun, Juwei Lu

TL;DR

A new VTE framework is introduced that explicitly aggregates information across the entire source video via a hybrid warping scheme, and processes video segments jointly with their history via a history-guided autoregressive diffusion model, enabling long-term temporal coherence.

Abstract

Video (camera) trajectory editing aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos. Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models. To address these issues, we introduce a new VTE framework that 1) explicitly aggregates information across the entire source video via a hybrid warping scheme. Specifically, static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement. 2) processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence. Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.

CamDirector: Towards Long-Term Coherent Video Trajectory Editing

TL;DR

A new VTE framework is introduced that explicitly aggregates information across the entire source video via a hybrid warping scheme, and processes video segments jointly with their history via a history-guided autoregressive diffusion model, enabling long-term temporal coherence.

Abstract

Video (camera) trajectory editing aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos. Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models. To address these issues, we introduce a new VTE framework that 1) explicitly aggregates information across the entire source video via a hybrid warping scheme. Specifically, static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement. 2) processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence. Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.
Paper Structure (17 sections, 2 equations, 12 figures, 4 tables)

This paper contains 17 sections, 2 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: We present a novel video trajectory editing framework capable of generating new videos along desired trajectories from the given ones to achieve aesthetically pleasing and cinematic camera movements. The proposed method performs favorably against state-of-the-art methods with notably fewer parameters ($\bullet$ and $\blacktriangle$ distinguish different benchmarks). Additional results and videos are available on our https://yinkejia.github.io/CamDirector-Project-Page.
  • Figure 2: Visual comparison between per-frame warping (a), our hybrid warping (b), and ground truth (c). Hybrid warping tends to produce more complete and source-aligned coarse frames.
  • Figure 3: Overview of our framework. Left: The hybrid warping scheme leverages the entire source video to construct coarse frames by processing dynamic and static regions separately, providing a global reference of the original scene content. Right: The CCDM conditions the generation on the coarse video via ControlNet, while source-frame tokens are concatenated with target tokens as inputs to the base T2V model to provide reliable motion and appearance priors.
  • Figure 4: Illustration of history-guided autoregressive generation. In each iteration, $T^{\star}$ previously generated frames serve as history to guide the synthesis of the next $T$ frames, along with the corresponding $T^{\star}+T$ source frames as input to produce the coarse frames and provide original scene context.
  • Figure 5: Illustration of progressive world cache update. Whenever a new segment is generated, we evenly sample $C$ frames as anchors, where the newly inpainted regions are merged into the world cache. The updated regions are highlighted in red.
  • ...and 7 more figures