GEN3D: Generating Domain-Free 3D Scenes from a Single Image
Yuxin Zhang, Ziyu Lu, Hongbo Duan, Keyu Fan, Pengting Luo, Peiyu Zhuang, Mengyu Yang, Houde Liu
TL;DR
Gen3d tackles the challenge of producing high-quality 3D scenes from a single image by integrating depth-guided foreground/background segmentation with diffusion-based inpainting, then building a growing point cloud along a predefined camera path. This geometric prior is refined via a 3D Gaussian Splatting representation to enable efficient, view-consistent novel-view rendering. The method supports inputs from text, RGB, or RGBD and leverages Stable Diffusion, monocular depth estimation, and explicit 3D representations to achieve domain-free 3D scene generation. Experimental results on WorldScore and qualitative benchmarks show improved 3D consistency and photometric quality, indicating strong practical potential for embodied AI and world-model learning.
Abstract
Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. Additionally, 3D scene generation is vital for advancing embodied AI and world models, which depend on diverse, high-quality scenes for learning and evaluation. In this work, we propose Gen3d, a novel method for generation of high-quality, wide-scope, and generic 3D scenes from a single image. After the initial point cloud is created by lifting the RGBD image, Gen3d maintains and expands its world model. The 3D scene is finalized through optimizing a Gaussian splatting representation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in generating a world model and Synthesizing high-fidelity and consistent novel views.
