Table of Contents
Fetching ...

GeoWorld: Unlocking the Potential of Geometry Models to Facilitate High-Fidelity 3D Scene Generation

Yuhao Wan, Lijuan Liu, Jingzhi Zhou, Zihan Zhou, Xuying Zhang, Dongbo Zhang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

TL;DR

GeoWorld tackles geometric distortions in image-to-3D scene generation by leveraging geometry models to supply full-frame geometry features as conditions for a diffusion-based video generator. It introduces a two-stage geometrical condition generation process (rendering then completion), a geometry alignment loss to enforce real-world geometric constraints, and a geometry adaptation module to effectively fuse geometry signals into the video model. The approach, validated on RealEstate10K and Tanks and Temples, achieves superior fidelity and competitive perceptual quality in both novel-view synthesis and 3D scene reconstruction, with ablations confirming the value of its geometric components. Overall, GeoWorld demonstrates that geometry-aware conditioning can substantially improve high-fidelity 3D scene generation from a single image and a camera trajectory, offering a new direction for geometry-guided 3D generation research.

Abstract

Previous works leveraging video models for image-to-3D scene generation tend to suffer from geometric distortions and blurry content. In this paper, we renovate the pipeline of image-to-3D scene generation by unlocking the potential of geometry models and present our GeoWorld. Instead of exploiting geometric information obtained from a single-frame input, we propose to first generate consecutive video frames and then take advantage of the geometry model to provide full-frame geometry features, which contain richer information than single-frame depth maps or camera embeddings used in previous methods, and use these geometry features as geometrical conditions to aid the video generation model. To enhance the consistency of geometric structures, we further propose a geometry alignment loss to provide the model with real-world geometric constraints and a geometry adaptation module to ensure the effective utilization of geometry features. Extensive experiments show that our GeoWorld can generate high-fidelity 3D scenes from a single image and a given camera trajectory, outperforming prior methods both qualitatively and quantitatively. Project Page: https://peaes.github.io/GeoWorld/.

GeoWorld: Unlocking the Potential of Geometry Models to Facilitate High-Fidelity 3D Scene Generation

TL;DR

GeoWorld tackles geometric distortions in image-to-3D scene generation by leveraging geometry models to supply full-frame geometry features as conditions for a diffusion-based video generator. It introduces a two-stage geometrical condition generation process (rendering then completion), a geometry alignment loss to enforce real-world geometric constraints, and a geometry adaptation module to effectively fuse geometry signals into the video model. The approach, validated on RealEstate10K and Tanks and Temples, achieves superior fidelity and competitive perceptual quality in both novel-view synthesis and 3D scene reconstruction, with ablations confirming the value of its geometric components. Overall, GeoWorld demonstrates that geometry-aware conditioning can substantially improve high-fidelity 3D scene generation from a single image and a camera trajectory, offering a new direction for geometry-guided 3D generation research.

Abstract

Previous works leveraging video models for image-to-3D scene generation tend to suffer from geometric distortions and blurry content. In this paper, we renovate the pipeline of image-to-3D scene generation by unlocking the potential of geometry models and present our GeoWorld. Instead of exploiting geometric information obtained from a single-frame input, we propose to first generate consecutive video frames and then take advantage of the geometry model to provide full-frame geometry features, which contain richer information than single-frame depth maps or camera embeddings used in previous methods, and use these geometry features as geometrical conditions to aid the video generation model. To enhance the consistency of geometric structures, we further propose a geometry alignment loss to provide the model with real-world geometric constraints and a geometry adaptation module to ensure the effective utilization of geometry features. Extensive experiments show that our GeoWorld can generate high-fidelity 3D scenes from a single image and a given camera trajectory, outperforming prior methods both qualitatively and quantitatively. Project Page: https://peaes.github.io/GeoWorld/.

Paper Structure

This paper contains 14 sections, 4 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Visual comparisons. Top: Comparison between our GeoWorld and previous methods. By incorporating geometry constraints, our approach achieves superior visual quality. Bottom left: Results before applying geometry constraints, which often suffer from geometric distortions and blurry content. Bottom right: Results after applying geometry constraints. By unlocking the potential of geometry models, our GeoWorld produces clear geometric structures and sharp visual details.
  • Figure 2: Pipeline comparison. (a) Pipelines of previous methods. Although details vary, their video models are conditioned only on single-frame information and limited geometric information. (b) Our GeoWorld leverages the geometrical condition generation procedure and a geometry model to obtain full-frame geometry features.
  • Figure 3: Overview of our GeoWorld. During training, we perform the geometrical condition generation procedure, feeding the obtained condition views into the geometry model to obtain full-frame geometry features, which are then processed by the geometry adaptation module. The condition views and the geometry features together serve as input to the geometry-constrained diffusion model to embed the geometric information. The predicted views produced by this model are then used to reconstruct the 3DGS scene.
  • Figure 4: Visual comparisons of the design of the geometry-constrained diffusion model. Base model refers to directly embedding the geometry features into the model via cross-attention. 'GAL': geometry alignment loss. 'GAM': geometry adaptation module.
  • Figure 5: Qualitative comparison of GeoWorld with state-of-the-art methods on novel view synthesis.
  • ...and 10 more figures