SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

Alexey Bokhovkin; Quan Meng; Shubham Tulsiani; Angela Dai

SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

Alexey Bokhovkin, Quan Meng, Shubham Tulsiani, Angela Dai

TL;DR

SceneFactor addresses scalable, controllable 3D scene generation by decoupling semantic layout from detailed geometry into a factored diffusion framework that operates on proxy spaces $S$ (semantic) and $G$ (geometry). It learns dual latent spaces via VQ-VAE and applies two diffusion models, $\\Psi_S$ and $\\Psi_G$, to generate coarse semantic maps conditioned on text and refine geometry conditioned on those semantics, respectively. The method supports outpainting for arbitrarily large scenes and enables intuitive localized editing by manipulating semantic boxes, with edits propagating coherently to geometry through aligned latent grids. Empirical results on 3D-FRONT and 3D-FUTURE show improved geometry fidelity and text adherence over state-of-the-art baselines, validated by metrics such as $\text{MMD}$, $\text{COV}$, $\text{1-NNA}$, CLIP scores, and perceptual studies, highlighting its potential for artist-driven automated 3D content creation.

Abstract

We present SceneFactor, a diffusion-based approach for large-scale 3D scene generation that enables controllable generation and effortless editing. SceneFactor enables text-guided 3D scene synthesis through our factored diffusion formulation, leveraging latent semantic and geometric manifolds for generation of arbitrary-sized 3D scenes. While text input enables easy, controllable generation, text guidance remains imprecise for intuitive, localized editing and manipulation of the generated 3D scenes. Our factored semantic diffusion generates a proxy semantic space composed of semantic 3D boxes that enables controllable editing of generated scenes by adding, removing, changing the size of the semantic 3D proxy boxes that guides high-fidelity, consistent 3D geometric editing. Extensive experiments demonstrate that our approach enables high-fidelity 3D scene synthesis with effective controllable editing through our factored diffusion approach.

SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

TL;DR

SceneFactor addresses scalable, controllable 3D scene generation by decoupling semantic layout from detailed geometry into a factored diffusion framework that operates on proxy spaces

(semantic) and

(geometry). It learns dual latent spaces via VQ-VAE and applies two diffusion models,

and

, to generate coarse semantic maps conditioned on text and refine geometry conditioned on those semantics, respectively. The method supports outpainting for arbitrarily large scenes and enables intuitive localized editing by manipulating semantic boxes, with edits propagating coherently to geometry through aligned latent grids. Empirical results on 3D-FRONT and 3D-FUTURE show improved geometry fidelity and text adherence over state-of-the-art baselines, validated by metrics such as

, CLIP scores, and perceptual studies, highlighting its potential for artist-driven automated 3D content creation.

SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

TL;DR

Abstract

SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)