Table of Contents
Fetching ...

GEN3D: Generating Domain-Free 3D Scenes from a Single Image

Yuxin Zhang, Ziyu Lu, Hongbo Duan, Keyu Fan, Pengting Luo, Peiyu Zhuang, Mengyu Yang, Houde Liu

TL;DR

Gen3d tackles the challenge of producing high-quality 3D scenes from a single image by integrating depth-guided foreground/background segmentation with diffusion-based inpainting, then building a growing point cloud along a predefined camera path. This geometric prior is refined via a 3D Gaussian Splatting representation to enable efficient, view-consistent novel-view rendering. The method supports inputs from text, RGB, or RGBD and leverages Stable Diffusion, monocular depth estimation, and explicit 3D representations to achieve domain-free 3D scene generation. Experimental results on WorldScore and qualitative benchmarks show improved 3D consistency and photometric quality, indicating strong practical potential for embodied AI and world-model learning.

Abstract

Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. Additionally, 3D scene generation is vital for advancing embodied AI and world models, which depend on diverse, high-quality scenes for learning and evaluation. In this work, we propose Gen3d, a novel method for generation of high-quality, wide-scope, and generic 3D scenes from a single image. After the initial point cloud is created by lifting the RGBD image, Gen3d maintains and expands its world model. The 3D scene is finalized through optimizing a Gaussian splatting representation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in generating a world model and Synthesizing high-fidelity and consistent novel views.

GEN3D: Generating Domain-Free 3D Scenes from a Single Image

TL;DR

Gen3d tackles the challenge of producing high-quality 3D scenes from a single image by integrating depth-guided foreground/background segmentation with diffusion-based inpainting, then building a growing point cloud along a predefined camera path. This geometric prior is refined via a 3D Gaussian Splatting representation to enable efficient, view-consistent novel-view rendering. The method supports inputs from text, RGB, or RGBD and leverages Stable Diffusion, monocular depth estimation, and explicit 3D representations to achieve domain-free 3D scene generation. Experimental results on WorldScore and qualitative benchmarks show improved 3D consistency and photometric quality, indicating strong practical potential for embodied AI and world-model learning.

Abstract

Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. Additionally, 3D scene generation is vital for advancing embodied AI and world models, which depend on diverse, high-quality scenes for learning and evaluation. In this work, we propose Gen3d, a novel method for generation of high-quality, wide-scope, and generic 3D scenes from a single image. After the initial point cloud is created by lifting the RGBD image, Gen3d maintains and expands its world model. The 3D scene is finalized through optimizing a Gaussian splatting representation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in generating a world model and Synthesizing high-fidelity and consistent novel views.

Paper Structure

This paper contains 10 sections, 5 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Gen3d pipeline.Initially, the input 2D image is segmented into two distinct components: the foreground objects and the background. We adopt methodologies such as the Stable Diffusion model and monocular depth estimation to enhance point cloud coverage and facilitate the construction of larger-scale scenes.Subsequently, we employ the point cloud alongside the reprojected images to optimize a set of Gaussian splats, further refining the resulting 3D scene.
  • Figure 2: Comparisons of multi-view generation results across different methods.The images are sourced from the COCO dataset, the WorldScore Benchmark, and web-sourced images.