Table of Contents
Fetching ...

WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories

Yisu Zhang, Chenjie Cao, Tengfei Wang, Xuhui Zuo, Junta Wu, Jianke Zhu, Chunchao Guo

TL;DR

WorldStereo is proposed, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules, and shows that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks with high-fidelity 3D results.

Abstract

Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank. These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training. Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.

WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories

TL;DR

WorldStereo is proposed, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules, and shows that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks with high-fidelity 3D results.

Abstract

Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank. These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training. Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.
Paper Structure (30 sections, 3 equations, 8 figures, 7 tables)

This paper contains 30 sections, 3 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: WorldStereo enables high-quality 3D scene generation based on single-view or panoramic inputs. The input reference views are framed in green. We present point clouds reconstructed from videos generated by WorldStereo: the top two perspective scenes use WorldMirror liu2025worldmirror, while the bottom two panoramic scenes are aligned via monocular depth maps wang2025moge.
  • Figure 2: Overview of WorldStereo. WorldStereo comprises two ControlNet branches. The camera branch ensures precise camera control and Global-Geometric Memory (GGM), depending on global point clouds; the Spatial-Stereo Memory (SSM) branch leverages retrieved reference frames and pointmap (3D correspondence) guidance obtained from the 3D cache to further preserve fine-grained consistency. We omit the diffusion noise part for simplicity.
  • Figure 3: Spatial-Stereo Memory (SSM). Reference views are retrieved from the memory bank, while pointmaps for both target and reference views are constructed based on the 3D cache. In SSM attention, we horizontally stitch each target-reference pair and rearrange the tensor shape to make each target frame's features focus on the specifically retrieved reference. B, F, H, W, C indicate dimensions of batch, frame, height, width, and channels.
  • Figure 4: Results of 3D reconstruction benchmark. The column (a) shows input views and ground-truth point clouds with pre-defined four trajectories (up, left, right rotations, and orbit). We compare the qualitative results of reconstructed point clouds (left) and generated novel views (right) for each method.
  • Figure 5: Ablation studies of memory components. Please see the red-framed regions to check the consistency compared to retrieved references. Baseline results are generated without any memory. GGM can capture coarse structures, but loses fine-grained details. Moreover, the incorporation of pointmap significantly enhances the consistency gained via the reference frames retrieved from the memory bank.
  • ...and 3 more figures