Table of Contents
Fetching ...

SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

Yanxu Meng, Haoning Wu, Ya Zhang, Weidi Xie

TL;DR

This paper tackles single-image 3D scene generation by proposing SceneGen, a feedforward model that jointly synthesizes multiple 3D assets with geometry, texture, and relative spacings from a scene image and object masks.It introduces a dual-encoder feature extraction scheme (DINOv2 for visuals and VGGT for geometry) and a feature aggregation module with local and global attention to enable coherent inter-asset relationships in a single pass.Training uses conditional flow matching, position-aware losses, and a collision penalty on 3D-FUTURE data, while only the global-attention components and position head are learned; the method generalizes to multi-view inputs at inference, improving geometry without additional training.Empirical results show SceneGen outperforms existing baselines in both geometric fidelity and texture quality, generating scenes with multiple assets within a practical time, and qualitative exploration on ScanNet++ confirms robustness to multi-view inputs.

Abstract

3D content generation has recently attracted significant research interest, driven by its critical applications in VR/AR and embodied AI. In this work, we tackle the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for extra optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architecture yields improved generation performance when multiple images are provided; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robustness of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.

SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

TL;DR

This paper tackles single-image 3D scene generation by proposing SceneGen, a feedforward model that jointly synthesizes multiple 3D assets with geometry, texture, and relative spacings from a scene image and object masks.It introduces a dual-encoder feature extraction scheme (DINOv2 for visuals and VGGT for geometry) and a feature aggregation module with local and global attention to enable coherent inter-asset relationships in a single pass.Training uses conditional flow matching, position-aware losses, and a collision penalty on 3D-FUTURE data, while only the global-attention components and position head are learned; the method generalizes to multi-view inputs at inference, improving geometry without additional training.Empirical results show SceneGen outperforms existing baselines in both geometric fidelity and texture quality, generating scenes with multiple assets within a practical time, and qualitative exploration on ScanNet++ confirms robustness to multi-view inputs.

Abstract

3D content generation has recently attracted significant research interest, driven by its critical applications in VR/AR and embodied AI. In this work, we tackle the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for extra optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architecture yields improved generation performance when multiple images are provided; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robustness of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.

Paper Structure

This paper contains 22 sections, 15 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview. Our proposed SceneGen framework takes a single scene image and its corresponding object masks as inputs, and efficiently generates multiple 3D assets with coherent geometry, texture, and spatial arrangement in a single feedforward pass.
  • Figure 2: 3D Scene Generation. (a) Existing methods typically require segmenting target objects from the scene image; (b) Two-stage methods like CAST yao2025cast sequentially retrieve or generate individual assets, then assemble them via post-processing; (c) Methods such as MIDI huang2025midi directly generate multiple assets from a single image, but suffer from blurry details and unreasonable spatial layouts; (d) In contrast, our SceneGen jointly synthesizes the geometry, texture, and spatial positions of multiple assets in a single feedforward pass, producing plausible 3D scenes.
  • Figure 3: Architecture Overview.SceneGen takes a single scene image with multiple objects and corresponding segmentation masks as input. A pre-trained local attention block first refines the texture of each asset. Then, our introduced global attention block integrates asset-level and scene-level features extracted by dedicated visual and geometric encoders. Finally, two off-the-shelf structure decoders and our position head decode these latent features into multiple 3D assets with geometry, texture, and relative spatial positions.
  • Figure 4: Qualitative Comparisons on the 3D FUTURE Test Set and ScanNet++. Our proposed SceneGen is capable of generating physically plausible 3D scenes featuring complete structures, detailed textures, and precise spatial relationships, demonstrating superior performance over prior methods in terms of both geometric accuracy and visual quality on both the synthetic and real-world datasets.
  • Figure 5: Qualitative Results with Multi-view Inputs. SceneGen can directly handle multi-view inputs in ScanNet++ and even achieves better generation quality, especially accurate structure.
  • ...and 3 more figures