Table of Contents
Fetching ...

ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

Byeongjun Park, Byung-Hoon Kim, Hyungjin Chung, Jong Chul Ye

TL;DR

ReDirector tackles the challenge of generating realistic video retakes for variable-length inputs under dynamic camera motion. It rectifies prior RoPE misuses by applying a shared 3D RoPE to both input and target videos and introduces Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that encodes multi-view geometry within attention. The method couples RoCE with geometry-aware attention, yielding improved geometric consistency, accurate dynamic object localization, and better background preservation across long sequences and out-of-distribution trajectories. Evaluations on DAVIS, iPhone datasets, and ReCamMaster trajectories show robust camera controllability and superior visual quality compared to warping-based and implicit baselines, with strong generalization to unseen lengths and resolutions. The work advances practical camera-controlled video retargeting by enabling flexible, high-fidelity retakes without reliance on external depth or geometry estimators, using token-level multi-view reasoning within a diffusion-Transformer framework.

Abstract

We present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introduce Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures and integrates multi-view relationships within and across the input and target videos. By integrating camera conditions into RoPE, our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation. Extensive experiments further demonstrate significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths.

ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

TL;DR

ReDirector tackles the challenge of generating realistic video retakes for variable-length inputs under dynamic camera motion. It rectifies prior RoPE misuses by applying a shared 3D RoPE to both input and target videos and introduces Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that encodes multi-view geometry within attention. The method couples RoCE with geometry-aware attention, yielding improved geometric consistency, accurate dynamic object localization, and better background preservation across long sequences and out-of-distribution trajectories. Evaluations on DAVIS, iPhone datasets, and ReCamMaster trajectories show robust camera controllability and superior visual quality compared to warping-based and implicit baselines, with strong generalization to unseen lengths and resolutions. The work advances practical camera-controlled video retargeting by enabling flexible, high-fidelity retakes without reliance on external depth or geometry estimators, using token-level multi-view reasoning within a diffusion-Transformer framework.

Abstract

We present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introduce Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures and integrates multi-view relationships within and across the input and target videos. By integrating camera conditions into RoPE, our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation. Extensive experiments further demonstrate significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths.

Paper Structure

This paper contains 36 sections, 16 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: ReDirector for video retake generation. Given any-length input videos (①), ReDirector generates realistic retakes (②) along the target camera trajectories even with dynamic camera motion in the input video. ReDirector is capable of accurately localizing dynamic objects while preserving static backgrounds throughout the sequence, leading to multi-view consistent retakes spanning hundreds of frames.
  • Figure 2: Overview of ReDirector. (a) ReDirector is fine-tuned on Wan-I2V-CamCtrl wan2025wan, which incorporates camera control signals into image-to-video generation. Our goal is to reconstruct a video retake $\mathbf{V'}_t$ conditioned on target camera trajectories $\mathbf{P}_t$, input video $\mathbf{V}_s$, and its poses $\mathbf{P}_s$. (b) Following ReCamMaster Bai_2025_ICCV, we train only self-attention layers while keeping the remaining modules frozen. (c) We insert RoCE into self-attention layers, whose outputs are used as camera-conditioned RoPE phase shifts. First, ${\bm{\phi}}_{{\mathbf{q}}{\mathbf{v}}}$ is applied to queries and keys, providing physically grounded rotary position encoding. Second, ${\bm{\phi}}_{{\mathbf{v}}{\mathbf{o}}}$ modulates the value path by applying $-{\bm{\phi}}_{{\mathbf{v}}{\mathbf{o}}}$ before attention weighting and $+{\bm{\phi}}_{{\mathbf{v}}{\mathbf{o}}}$ after value aggregation, enabling geometry-aware attention. For clarity, text prompts in cross-attention are omitted.
  • Figure 3: Attention of RoCE. We visualize the attention from the colored dot in the leftmost column to the first frame (left), and across five uniformly sampled frames (right). Within each frame, attention varies with pixel coordinates, whereas differences in relative pose have a more pronounced impact on the attention scores.
  • Figure 4: Qualitative results on the DAVIS dataset pont20172017. ReDirector generates realistic video retakes (②) from dynamically captured input video (①), achieving better camera control, dynamic object localization, and background preservation.
  • Figure 5: Qualitative ablation of conditioning type on DAVIS pont20172017. Compared to simple additive conditioning, RoCE enhances geometric consistency, while the additional geometry-aware attention further boosts retake quality and achieves accurate dynamic object localization.
  • ...and 6 more figures