Table of Contents
Fetching ...

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

Tianxing Xu, Zixuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang, Peng Zhang, Bang Zhang, Song-Hai Zhang

TL;DR

UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism, is presented and an efficient dual-stream diffusion transformer for high-fidelity generation is designed to reduce computational overhead.

Abstract

World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

TL;DR

UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism, is presented and an efficient dual-stream diffusion transformer for high-fidelity generation is designed to reduce computational overhead.

Abstract

World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.
Paper Structure (13 sections, 5 equations, 7 figures, 3 tables)

This paper contains 13 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An overview of our proposed UCM. Given previously generated frames and a specific camera trajectory as input, UCM encodes the historical frames into clean tokens to condition the denoising of noisy tokens. For camera control and memory injection, the framework proposes time-aware positional encoding warping to establish spatio-temporal correspondence and an efficient dual-stream transformer architecture for processing. After iterative denoising, UCM yields a high-fidelity, scene-consistent video that adheres to the user-specified trajectory.
  • Figure 2: The architecture of UCM DiT-block. Each noisy token attends to all other noisy tokens and is guided by clean tokens via time-aware warped PEs, implemented through KV concatenation. For the clean tokens, each token attends only to other clean tokens within the same frame using original PEs. This block-sparse attention mask (here, with $k_j=j$ for visualization) enables camera control and memory guidance with reduced computational cost.
  • Figure 3: Simulated revisiting from different viewpoints.We apply point cloud rendering with randomly perturbed viewpoints to simulate revisiting of the same scene for monocular videos.
  • Figure 4: Visual comparison of camera controllability. We highlight imprecise camera-controlled frame generation with red boxes.
  • Figure 5: Visual comparison of long-term memory under two evaluation settings. Red boxes highlight obvious failure cases of camera-controlled generation or inconsistent scene generation.
  • ...and 2 more figures