Table of Contents
Fetching ...

ReRoPE: Repurposing RoPE for Relative Camera Control

Chunyang Li, Yuanbo Yang, Jiahao Shao, Hongyu Zhou, Katja Schwarz, Yiyi Liao

TL;DR

ReRoPE tackles the lack of shift-invariance and generalization in camera-controllable video generation by repurposing underutilized low-frequency components of Rotary Positional Embeddings (RoPE) to encode relative camera geometry. The method keeps the pre-trained backbone intact and injects relative camera information through a lightweight projection block in the temporal RoPE channels, enabling precise V2V and I2V control with minimal training cost. Key contributions include identifying low-frequency redundancy in RoPE, proposing a simple, plug-and-play conditioning mechanism, and demonstrating superior camera accuracy and 3D consistency without sacrificing visual fidelity across diverse datasets. This approach provides a practical pathway to controllable, high-fidelity video generation by leveraging existing generative priors and avoiding architectural overhauls.

Abstract

Video generation with controllable camera viewpoints is essential for applications such as interactive content creation, gaming, and simulation. Existing methods typically adapt pre-trained video models using camera poses relative to a fixed reference, e.g., the first frame. However, these encodings lack shift-invariance, often leading to poor generalization and accumulated drift. While relative camera pose embeddings defined between arbitrary view pairs offer a more robust alternative, integrating them into pre-trained video diffusion models without prohibitive training costs or architectural changes remains challenging. We introduce ReRoPE, a plug-and-play framework that incorporates relative camera information into pre-trained video diffusion models without compromising their generation capability. Our approach is based on the insight that Rotary Positional Embeddings (RoPE) in existing models underutilize their full spectral bandwidth, particularly in the low-frequency components. By seamlessly injecting relative camera pose information into these underutilized bands, ReRoPE achieves precise control while preserving strong pre-trained generative priors. We evaluate our method on both image-to-video (I2V) and video-to-video (V2V) tasks in terms of camera control accuracy and visual fidelity. Our results demonstrate that ReRoPE offers a training-efficient path toward controllable, high-fidelity video generation. See project page for more results: https://sisyphe-lee.github.io/ReRoPE/

ReRoPE: Repurposing RoPE for Relative Camera Control

TL;DR

ReRoPE tackles the lack of shift-invariance and generalization in camera-controllable video generation by repurposing underutilized low-frequency components of Rotary Positional Embeddings (RoPE) to encode relative camera geometry. The method keeps the pre-trained backbone intact and injects relative camera information through a lightweight projection block in the temporal RoPE channels, enabling precise V2V and I2V control with minimal training cost. Key contributions include identifying low-frequency redundancy in RoPE, proposing a simple, plug-and-play conditioning mechanism, and demonstrating superior camera accuracy and 3D consistency without sacrificing visual fidelity across diverse datasets. This approach provides a practical pathway to controllable, high-fidelity video generation by leveraging existing generative priors and avoiding architectural overhauls.

Abstract

Video generation with controllable camera viewpoints is essential for applications such as interactive content creation, gaming, and simulation. Existing methods typically adapt pre-trained video models using camera poses relative to a fixed reference, e.g., the first frame. However, these encodings lack shift-invariance, often leading to poor generalization and accumulated drift. While relative camera pose embeddings defined between arbitrary view pairs offer a more robust alternative, integrating them into pre-trained video diffusion models without prohibitive training costs or architectural changes remains challenging. We introduce ReRoPE, a plug-and-play framework that incorporates relative camera information into pre-trained video diffusion models without compromising their generation capability. Our approach is based on the insight that Rotary Positional Embeddings (RoPE) in existing models underutilize their full spectral bandwidth, particularly in the low-frequency components. By seamlessly injecting relative camera pose information into these underutilized bands, ReRoPE achieves precise control while preserving strong pre-trained generative priors. We evaluate our method on both image-to-video (I2V) and video-to-video (V2V) tasks in terms of camera control accuracy and visual fidelity. Our results demonstrate that ReRoPE offers a training-efficient path toward controllable, high-fidelity video generation. See project page for more results: https://sisyphe-lee.github.io/ReRoPE/
Paper Structure (31 sections, 10 equations, 10 figures, 4 tables)

This paper contains 31 sections, 10 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Toy Case Analysis of RoPE. The heatmap shows attention scores of unit key and query vectors across different frequencies and index distances, revealing that low-frequency bands exhibit negligible phase shifts ($\approx 0$) across the window.
  • Figure 2: Frequency analysis across three video diffusion models kong2024hunyuanvideocogvideowan2021. We observe that masking high-frequency bands (Left) leads to model collapse, whereas masking low-frequency bands (Middle) maintains generation quality comparable to the baseline (Right), confirming the functional redundancy of low-frequency components.
  • Figure 3: Overview of ReRoPE framework. We enable relative camera control by repurposing the redundant low-frequency temporal bands of a pre-trained Video DiT. (Left and Middle) As shown in the matrix decomposition, we retain spatial and high-frequency temporal bands to preserve generative priors, replacing only the underutilized low-frequency temporal bands with our camera projection block. (Right) This plug-and-play mechanism supports both Video-to-Video and Image-to-Video tasks.
  • Figure 4: Qualitative comparison of V2V generation. Trajectory visualization shows that baselines (red) deviate from the ground truth (green). In contrast, ReRoPE (blue) maintains tight alignment, demonstrating the precise geometric control required for superior visual fidelity and consistency.
  • Figure 5: Qualitative comparison on dynamic I2V scenes showing that ReRoPE synthesizes natural object motion, whereas baselines incorrectly render the subject as a static rigid body.
  • ...and 5 more figures