Table of Contents
Fetching ...

SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

Alexey Bokhovkin, Quan Meng, Shubham Tulsiani, Angela Dai

TL;DR

SceneFactor addresses scalable, controllable 3D scene generation by decoupling semantic layout from detailed geometry into a factored diffusion framework that operates on proxy spaces $S$ (semantic) and $G$ (geometry). It learns dual latent spaces via VQ-VAE and applies two diffusion models, $\\Psi_S$ and $\\Psi_G$, to generate coarse semantic maps conditioned on text and refine geometry conditioned on those semantics, respectively. The method supports outpainting for arbitrarily large scenes and enables intuitive localized editing by manipulating semantic boxes, with edits propagating coherently to geometry through aligned latent grids. Empirical results on 3D-FRONT and 3D-FUTURE show improved geometry fidelity and text adherence over state-of-the-art baselines, validated by metrics such as $\text{MMD}$, $\text{COV}$, $\text{1-NNA}$, CLIP scores, and perceptual studies, highlighting its potential for artist-driven automated 3D content creation.

Abstract

We present SceneFactor, a diffusion-based approach for large-scale 3D scene generation that enables controllable generation and effortless editing. SceneFactor enables text-guided 3D scene synthesis through our factored diffusion formulation, leveraging latent semantic and geometric manifolds for generation of arbitrary-sized 3D scenes. While text input enables easy, controllable generation, text guidance remains imprecise for intuitive, localized editing and manipulation of the generated 3D scenes. Our factored semantic diffusion generates a proxy semantic space composed of semantic 3D boxes that enables controllable editing of generated scenes by adding, removing, changing the size of the semantic 3D proxy boxes that guides high-fidelity, consistent 3D geometric editing. Extensive experiments demonstrate that our approach enables high-fidelity 3D scene synthesis with effective controllable editing through our factored diffusion approach.

SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

TL;DR

SceneFactor addresses scalable, controllable 3D scene generation by decoupling semantic layout from detailed geometry into a factored diffusion framework that operates on proxy spaces (semantic) and (geometry). It learns dual latent spaces via VQ-VAE and applies two diffusion models, and , to generate coarse semantic maps conditioned on text and refine geometry conditioned on those semantics, respectively. The method supports outpainting for arbitrarily large scenes and enables intuitive localized editing by manipulating semantic boxes, with edits propagating coherently to geometry through aligned latent grids. Empirical results on 3D-FRONT and 3D-FUTURE show improved geometry fidelity and text adherence over state-of-the-art baselines, validated by metrics such as , , , CLIP scores, and perceptual studies, highlighting its potential for artist-driven automated 3D content creation.

Abstract

We present SceneFactor, a diffusion-based approach for large-scale 3D scene generation that enables controllable generation and effortless editing. SceneFactor enables text-guided 3D scene synthesis through our factored diffusion formulation, leveraging latent semantic and geometric manifolds for generation of arbitrary-sized 3D scenes. While text input enables easy, controllable generation, text guidance remains imprecise for intuitive, localized editing and manipulation of the generated 3D scenes. Our factored semantic diffusion generates a proxy semantic space composed of semantic 3D boxes that enables controllable editing of generated scenes by adding, removing, changing the size of the semantic 3D proxy boxes that guides high-fidelity, consistent 3D geometric editing. Extensive experiments demonstrate that our approach enables high-fidelity 3D scene synthesis with effective controllable editing through our factored diffusion approach.

Paper Structure

This paper contains 21 sections, 14 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: SceneFactor factors the complex task of text-guided 3D scene generation into forming a coarse semantic structure, followed by refined geometric synthesis. Rather than require a learned model to decide the location, type, size, and local geometry of scene elements directly, our generation of a coarse semantic box layout enables training a simpler task of layout-guided geometric synthesis. To achieve this factorized generation, we train semantic and geometric latent diffusion models. Crucially, the proxy semantic map generation enables user-friendly localized editing of generated scenes by editing in the semantic map with simple box operations (by clicking two box corners), without requiring re-synthesis of the full scene. Note that input text is colored by semantic categories for visualization purposes only.
  • Figure 2: Method overview. We formulate text-guided 3D scene generation as a factored diffusion process, first generating a coarse semantic box layout representing the text input (left), followed by synthesis of scene geometry corresponding to the generated semantics (right). This factorization makes complex 3D scene generation more tractable and enables generation of locally editable 3D scenes, which can be manipulated through box manipulations in the semantic maps. Left: Our high-level semantic generation produces a coarse, box-level representation of a scene through latent diffusion on a pretrained semantic manifold, conditioned on text captions. This enables accurate alignment between text input and scene layout, without requiring solving a highly ambiguous generation task for geometric detail. Right: Conditioned on the coarse semantic box map, we use another latent diffusion model to generate 3D scene geometry, enabling spatial semantic grounding of generated scene objects and structures. Object categories in the text input are colored for visualization only.
  • Figure 3: Chunk-based 3D scene generation. Left: Chunks for a scene are generated in sliding-window fashion (1-2-3), with overlap between generated chunks to ensure scene consistency along boundaries. Right: Synthesis of a chunk (chunk 3) is based on regions of previously generated chunks (1,2). The purple incomplete region is then synthesized by inpainting based on the previously generated blue, green, and yellow regions.
  • Figure 4: Scene editing. SceneFactor enables seamless localized editing through easy manipulation of the 3D semantic box map. We demonstrate the addition of objects (adding boxes), moving objects (moving an existing semantic box), changing object size (scaling an existing semantic box), replacing objects (replacing an existing object box with a new one of a different category), and removing objects (removing an existing semantic box). Note that the rest of the 3D scene remains consistent outside of the editing region.
  • Figure 5: Qualitative comparisons to state-of-the-art diffusion-based 3D scene generative approaches BlockFusion Wu2024blockfusion, and SDFusion cheng2023sdfusion. Our approach produces improved scene geometry and more cohesive global scene structure with consistent walls compared to baselines. *Note that results for BlockFusion are generated unconditionally.
  • ...and 7 more figures