Table of Contents
Fetching ...

Unified Camera Positional Encoding for Controlled Video Generation

Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, Jianfei Cai

TL;DR

This work introduces Unified Camera Positional Encoding (UCPE), a geometry-consistent framework that jointly encodes 6-DoF poses, intrinsics, and lens distortions into Transformer-based video generation. UCPE combines Relative Ray Encoding for ray-space conditioning with an Absolute Orientation cue (Lat-Up map) to enable explicit control over initial camera orientation, and it integrates into pretrained diffusion transformers via a lightweight spatial attention adapter, adding less than 1% trainable parameters. A large camera-diverse dataset is built to train and evaluate the method across pinhole, wide-angle, fisheye, and panoramic projections, demonstrating improved lens control, pose accuracy, and visual fidelity compared with state-of-the-art baselines. The results suggest UCPE as a general, geometry-aware encoding for Transformers applicable to future multi-view, video, and 3D tasks.

Abstract

Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce Relative Ray Encoding, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for Absolute Orientation Encoding, enabling full control over the initial camera orientation. Together, these designs form UCPE (Unified Camera Positional Encoding), which integrates into a pretrained video Diffusion Transformer through a lightweight spatial attention adapter, adding less than 1% trainable parameters while achieving state-of-the-art camera controllability and visual fidelity. To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types. Extensive experiments validate the effectiveness of UCPE in camera-controllable video generation and highlight its potential as a general camera representation for Transformers across future multi-view, video, and 3D tasks. Code will be available at https://github.com/chengzhag/UCPE.

Unified Camera Positional Encoding for Controlled Video Generation

TL;DR

This work introduces Unified Camera Positional Encoding (UCPE), a geometry-consistent framework that jointly encodes 6-DoF poses, intrinsics, and lens distortions into Transformer-based video generation. UCPE combines Relative Ray Encoding for ray-space conditioning with an Absolute Orientation cue (Lat-Up map) to enable explicit control over initial camera orientation, and it integrates into pretrained diffusion transformers via a lightweight spatial attention adapter, adding less than 1% trainable parameters. A large camera-diverse dataset is built to train and evaluate the method across pinhole, wide-angle, fisheye, and panoramic projections, demonstrating improved lens control, pose accuracy, and visual fidelity compared with state-of-the-art baselines. The results suggest UCPE as a general, geometry-aware encoding for Transformers applicable to future multi-view, video, and 3D tasks.

Abstract

Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce Relative Ray Encoding, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for Absolute Orientation Encoding, enabling full control over the initial camera orientation. Together, these designs form UCPE (Unified Camera Positional Encoding), which integrates into a pretrained video Diffusion Transformer through a lightweight spatial attention adapter, adding less than 1% trainable parameters while achieving state-of-the-art camera controllability and visual fidelity. To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types. Extensive experiments validate the effectiveness of UCPE in camera-controllable video generation and highlight its potential as a general camera representation for Transformers across future multi-view, video, and 3D tasks. Code will be available at https://github.com/chengzhag/UCPE.

Paper Structure

This paper contains 45 sections, 32 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of our camera-controllable video generation. Given user-specified text prompt and camera parameters including horizontal field-of-view, distortion $\xi$, and camera poses with optional absolute orientation (encoded as latitude-up map), our model synthesizes realistic videos consistent with diverse camera geometries, demonstrating accurate pose and lens controllability and high visual fidelity. Applications span generative video content creation and world models for autonomous driving and embodied AI.
  • Figure 2: Comparison of camera encoding methods. (a) Direct Parameterization encodes camera intrinsics and extrinsics as raw parameters, which lacks geometric interpretability and compatibility across camera types. (b) Plücker Encoding represents each ray as a pair of direction and moment vectors, providing a physically grounded but absolute, coordinate-dependent description. (c) Projective Positional Encoding encodes relative cameras in projective space, yet assumes pinhole projection and cannot model non-linear lens distortions. (d) Our Relative Ray Encoding reformulates geometric relationships in ray space, where each token corresponds to its own viewing ray, enabling better pose generalization and compatibility with arbitrary camera lenses.
  • Figure 3: Overview of Spatial Attention Adapter. The adapter injects UCPE into pretrained Transformers through a lightweight branch that preserves pretrained priors. It constructs hybrid encoding from the world-to-ray transform $\mathbf{T}^{\textrm{rw}}$ and an optional Lat-Up map, applies them within attention, and fuses the resulting camera-aware tokens back through a zero-initialized linear layer.
  • Figure 4: Comparison on our synthesized dataset. UCPE faithfully follows target trajectories and produces consistent lens distortions aligned with visualization of the Lat-Up Map. In contrast, Wan CameraCtrl shows camera motion deviations, while ReCamMaster fails to reproduce the intended distortion. Colors correspond to the highlighted effects in the figure.
  • Figure 5: Comparison on the RealEstate10K dataset. UCPE generates sharper, more detailed frames that better follow the target camera motion. CameraCtrl produces severe artifacts (left) and poor composition (right), while AC3D preserves the training dataset's aesthetic but shows unbalanced framing (left) and low dynamic range (right). Wan CameraCtrl and ReCamMaster, though based the same backbone, struggle with camera consistency, leading to reduced motion (left) and undesired distortion artifacts (right) under the pinhole setup.
  • ...and 3 more figures