Table of Contents
Fetching ...

Geometry-Aware Rotary Position Embedding for Consistent Video World Model

Chendong Xiang, Jiajun Liu, Jintao Zhang, Xiao Yang, Zhengwei Fang, Shizun Wang, Zijun Wang, Yingtian Zou, Hang Su, Jun Zhu

TL;DR

This work tackles long-term geometric drift in camera-conditioned world models by identifying screen-space positional embeddings as a key bottleneck. It introduces ViewRope, a geometry-aware rotary position encoding that injects per-patch viewing-ray directions into attention, enabling 3D-consistent content retrieval across long histories. To scale to long sequences, Geometry-Aware Frame Sparse Attention selectively attends to geometrically relevant past frames, reducing compute while preserving loop-closure fidelity. A dedicated ViewBench diagnostic suite quantifies revisit fidelity and geometric drift, and experiments show that ViewRope yields superior long-term consistency with improved efficiency compared to prior geometry-aware or memory-based approaches. The combination of geometry-grounded attention and frame-sparse retrieval provides a practical pathway to reliable, controllable video world models for interactive AI applications.

Abstract

Predictive world models that simulate future observations under explicit camera control are fundamental to interactive AI. Despite rapid advances, current systems lack spatial persistence: they fail to maintain stable scene structures over long trajectories, frequently hallucinating details when cameras revisit previously observed locations. We identify that this geometric drift stems from reliance on screen-space positional embeddings, which conflict with the projective geometry required for 3D consistency. We introduce \textbf{ViewRope}, a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers. By parameterizing attention with relative ray geometry rather than pixel locality, ViewRope provides a model-native inductive bias for retrieving 3D-consistent content across temporal gaps. We further propose \textbf{Geometry-Aware Frame-Sparse Attention}, which exploits these geometric cues to selectively attend to relevant historical frames, improving efficiency without sacrificing memory consistency. We also present \textbf{ViewBench}, a diagnostic suite measuring loop-closure fidelity and geometric drift. Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.

Geometry-Aware Rotary Position Embedding for Consistent Video World Model

TL;DR

This work tackles long-term geometric drift in camera-conditioned world models by identifying screen-space positional embeddings as a key bottleneck. It introduces ViewRope, a geometry-aware rotary position encoding that injects per-patch viewing-ray directions into attention, enabling 3D-consistent content retrieval across long histories. To scale to long sequences, Geometry-Aware Frame Sparse Attention selectively attends to geometrically relevant past frames, reducing compute while preserving loop-closure fidelity. A dedicated ViewBench diagnostic suite quantifies revisit fidelity and geometric drift, and experiments show that ViewRope yields superior long-term consistency with improved efficiency compared to prior geometry-aware or memory-based approaches. The combination of geometry-grounded attention and frame-sparse retrieval provides a practical pathway to reliable, controllable video world models for interactive AI applications.

Abstract

Predictive world models that simulate future observations under explicit camera control are fundamental to interactive AI. Despite rapid advances, current systems lack spatial persistence: they fail to maintain stable scene structures over long trajectories, frequently hallucinating details when cameras revisit previously observed locations. We identify that this geometric drift stems from reliance on screen-space positional embeddings, which conflict with the projective geometry required for 3D consistency. We introduce \textbf{ViewRope}, a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers. By parameterizing attention with relative ray geometry rather than pixel locality, ViewRope provides a model-native inductive bias for retrieving 3D-consistent content across temporal gaps. We further propose \textbf{Geometry-Aware Frame-Sparse Attention}, which exploits these geometric cues to selectively attend to relevant historical frames, improving efficiency without sacrificing memory consistency. We also present \textbf{ViewBench}, a diagnostic suite measuring loop-closure fidelity and geometric drift. Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.
Paper Structure (49 sections, 16 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 49 sections, 16 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Camera-controlled video generation with ViewRope. Up: generated video with camera trajectories with loop closure (rotate-away-rotate-back). Down: generated videos with high motion minecraft gaming. ViewRope maintains consistent scene appearance when the camera revisits previously observed viewpoints.
  • Figure 2: Method overview.(a) ViewRope computes per-patch viewing rays from intrinsics, constructs local rotations, and rotates query/key feature subvectors in attention. The resulting dot product encodes relative angular relationships between viewing rays. (b) Geometry-Aware Frame Sparse Attention estimates block (frame) relevance and selects top-$k$ geometrically relevant historical frames, replacing quadratic dense attention with geometry-driven sparsity.
  • Figure 3: Visualization of attention specialization. Left: A standard temporal head focuses on recent or temporally periodic frames. Middle: A geometry-aware head captures long-range spatial overlap (evident in the antidiagonal activation during loop closure). Right: The aggregated attention map illustrates how geometric cues guide sparse block selection.
  • Figure 4: Case study. Upper and lower sequences show ViewRope with Sliding Window and Sparse attention, respectively.
  • Figure 5: Case 1: Yaw + Pitch loop closure in an urban street. M-G 2.0 suffers from brightness collapse. HY-WorldPlay exhibits geometric drift. ViewRope maintains structural and lighting consistency.
  • ...and 2 more figures