SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass
Yanxu Meng, Haoning Wu, Ya Zhang, Weidi Xie
TL;DR
This paper tackles single-image 3D scene generation by proposing SceneGen, a feedforward model that jointly synthesizes multiple 3D assets with geometry, texture, and relative spacings from a scene image and object masks.It introduces a dual-encoder feature extraction scheme (DINOv2 for visuals and VGGT for geometry) and a feature aggregation module with local and global attention to enable coherent inter-asset relationships in a single pass.Training uses conditional flow matching, position-aware losses, and a collision penalty on 3D-FUTURE data, while only the global-attention components and position head are learned; the method generalizes to multi-view inputs at inference, improving geometry without additional training.Empirical results show SceneGen outperforms existing baselines in both geometric fidelity and texture quality, generating scenes with multiple assets within a practical time, and qualitative exploration on ScanNet++ confirms robustness to multi-view inputs.
Abstract
3D content generation has recently attracted significant research interest, driven by its critical applications in VR/AR and embodied AI. In this work, we tackle the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for extra optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architecture yields improved generation performance when multiple images are provided; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robustness of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.
