Table of Contents
Fetching ...

World Reconstruction From Inconsistent Views

Lukas Höllein, Matthias Nießner

Abstract

Video diffusion models generate high-quality and diverse worlds; however, individual frames often lack 3D consistency across the output sequence, which makes the reconstruction of 3D worlds difficult. To this end, we propose a new method that handles these inconsistencies by non-rigidly aligning the video frames into a globally-consistent coordinate frame that produces sharp and detailed pointcloud reconstructions. First, a geometric foundation model lifts each frame into a pixel-wise 3D pointcloud, which contains unaligned surfaces due to these inconsistencies. We then propose a tailored non-rigid iterative frame-to-model ICP to obtain an initial alignment across all frames, followed by a global optimization that further sharpens the pointcloud. Finally, we leverage this pointcloud as initialization for 3D reconstruction and propose a novel inverse deformation rendering loss to create high quality and explorable 3D environments from inconsistent views. We demonstrate that our 3D scenes achieve higher quality than baselines, effectively turning video models into 3D-consistent world generators.

World Reconstruction From Inconsistent Views

Abstract

Video diffusion models generate high-quality and diverse worlds; however, individual frames often lack 3D consistency across the output sequence, which makes the reconstruction of 3D worlds difficult. To this end, we propose a new method that handles these inconsistencies by non-rigidly aligning the video frames into a globally-consistent coordinate frame that produces sharp and detailed pointcloud reconstructions. First, a geometric foundation model lifts each frame into a pixel-wise 3D pointcloud, which contains unaligned surfaces due to these inconsistencies. We then propose a tailored non-rigid iterative frame-to-model ICP to obtain an initial alignment across all frames, followed by a global optimization that further sharpens the pointcloud. Finally, we leverage this pointcloud as initialization for 3D reconstruction and propose a novel inverse deformation rendering loss to create high quality and explorable 3D environments from inconsistent views. We demonstrate that our 3D scenes achieve higher quality than baselines, effectively turning video models into 3D-consistent world generators.
Paper Structure (32 sections, 13 equations, 17 figures, 2 tables)

This paper contains 32 sections, 13 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Our method reconstructs 3D worlds from video diffusion models. We propose a tailored non-rigid deformation of predicted pointcloud geometry (mid) that resolves the 3D inconsistencies inherent in generated video sequences. Then, we utilize this improved alignment to optimize a Gaussian Splatting kerbl20233d scene. Our worlds can be explored from novel views at high visual fidelity (right).
  • Figure 2: Method overview. We propose a three stage method that reconstructs a 2DGS huang20242d scene from generated videos. First, we estimate multi-view depth and cameras with a geometric foundation model lin2025depth. The resulting dense scene initialization is unaligned (multiple non-overlapping surfaces) due to the inconsistent input frames. We propose a tailored non-rigid geometry alignment that leverages iterative frame-to-model ICP izadi2011kinectfusionbesl1992method and sparse correspondences, followed by global optimization, to create thin surfaces with detailed textures. Then, we leverage the alignment in a novel non-rigid aware 2DGS huang20242d optimization to obtain high-quality, consistent 3D worlds.
  • Figure 3: Single video 3D reconstruction. We generate videos with HY-WorldPlay worldplay2025 (top), Genie3 genie3 (mid), ViewCrafter yu2024viewcrafter (bottom) and 3D reconstruct them. Our method optimizes consistent worlds from inconsistent generated frames. Compared to baselines, the renderings are of higher visual fidelity from both input and novel views.
  • Figure 4: Single video 3D reconstruction. We generate videos with SEVA zhou2025stable (top), Gen3C ren2025gen3c (mid), Wan wan2025wanopenadvancedlargescale (bottom) and 3D reconstruct these frames. Inconsistencies in the generations lead to blurry textures for the baselines compared to the corresponding video, and to severe floating artifacts from novel views. In contrast, our method creates 3D consistent worlds with high fidelity beyond the generated perspectives.
  • Figure 5: Pointcloud reconstructions. We compare the quality of the reconstructed pointclouds that are used as initialization for Gaussian Splatting kerbl20233d optimization in the subsequent stages for each method. Our approach achieves the highest alignment and compelling textures for individual objects with no overlap of multiple surfaces.
  • ...and 12 more figures