Table of Contents
Fetching ...

Disentangled 3D Scene Generation with Layout Learning

Dave Epstein, Ben Poole, Ben Mildenhall, Alexei A. Efros, Aleksander Holynski

TL;DR

Disentangled 3D Scene Generation with Layout Learning introduces an unsupervised approach to decompose generated scenes into objects by optimizing $K$ NeRFs together with a set of learnable layouts under a pretrained text-to-image diffusion prior via score distillation sampling (SDS). Objects are defined as parts of a scene that can be rearranged by affine transforms, enabling the composite density $\tau'$ to be formed from $\sum_k \tau_k$ with color $\boldsymbol{\rho}' = \sum_k (\tau_k/\tau') \boldsymbol{\rho}_k$, producing coherent multi-object scenes. The method yields high-quality, editable 3D scenes and enables object-level manipulation and asset integration without supervision, with quantitative CLIP-based evaluation showing competitive disentanglement performance relative to per-object supervision. Limitations include the inherent ill-posedness of 3D disentanglement, failure modes like the Janus problem, and diffusion-model biases, underscoring ongoing challenges and ethical considerations in unsupervised text-to-3D generation.

Abstract

We introduce a method to generate 3D scenes that are disentangled into their component objects. This disentanglement is unsupervised, relying only on the knowledge of a large pretrained text-to-image model. Our key insight is that objects can be discovered by finding parts of a 3D scene that, when rearranged spatially, still produce valid configurations of the same scene. Concretely, our method jointly optimizes multiple NeRFs from scratch - each representing its own object - along with a set of layouts that composite these objects into scenes. We then encourage these composited scenes to be in-distribution according to the image generator. We show that despite its simplicity, our approach successfully generates 3D scenes decomposed into individual objects, enabling new capabilities in text-to-3D content creation. For results and an interactive demo, see our project page at https://dave.ml/layoutlearning/

Disentangled 3D Scene Generation with Layout Learning

TL;DR

Disentangled 3D Scene Generation with Layout Learning introduces an unsupervised approach to decompose generated scenes into objects by optimizing NeRFs together with a set of learnable layouts under a pretrained text-to-image diffusion prior via score distillation sampling (SDS). Objects are defined as parts of a scene that can be rearranged by affine transforms, enabling the composite density to be formed from with color , producing coherent multi-object scenes. The method yields high-quality, editable 3D scenes and enables object-level manipulation and asset integration without supervision, with quantitative CLIP-based evaluation showing competitive disentanglement performance relative to per-object supervision. Limitations include the inherent ill-posedness of 3D disentanglement, failure modes like the Janus problem, and diffusion-model biases, underscoring ongoing challenges and ethical considerations in unsupervised text-to-3D generation.

Abstract

We introduce a method to generate 3D scenes that are disentangled into their component objects. This disentanglement is unsupervised, relying only on the knowledge of a large pretrained text-to-image model. Our key insight is that objects can be discovered by finding parts of a 3D scene that, when rearranged spatially, still produce valid configurations of the same scene. Concretely, our method jointly optimizes multiple NeRFs from scratch - each representing its own object - along with a set of layouts that composite these objects into scenes. We then encourage these composited scenes to be in-distribution according to the image generator. We show that despite its simplicity, our approach successfully generates 3D scenes decomposed into individual objects, enabling new capabilities in text-to-3D content creation. For results and an interactive demo, see our project page at https://dave.ml/layoutlearning/
Paper Structure (16 sections, 4 equations, 8 figures)

This paper contains 16 sections, 4 equations, 8 figures.

Figures (8)

  • Figure 1: Layout learning generates disentangled 3D scenes given a text prompt and a pretrained text-to-image diffusion model. We learn an entire 3D scene (left, shown from two views along with surface normals and a textureless render) that is composed of multiple NeRFs (right) representing different objects and arranged according to a learned layout.
  • Figure 2: Method. Layout learning works by optimizing $K$ NeRFs $f_k$ and learning $N$ different layouts $\mathbf{L}_n$ for them, each consisting of per-NeRF affine transforms $\mathbf{T}_k$. Every iteration, a random layout is sampled and used to transform all NeRFs into a shared coordinate space. The resultant volume is rendered and optimized with score distillation sampling sds as well as per-NeRF regularizations to prevent degenerate decompositions and geometries Barron_2022_CVPR. This simple structure causes object disentanglement to emerge in generated 3D scenes.
  • Figure 3: Evaluating disentanglement and quality. We optimize a model with $K=3$ NeRFs on a list of 30 prompts, each containing three objects. We then automatically pair each NeRF with a description of one of the objects in the prompt and report average NeRF-object CLIP score (see text for details). We also generate each of the $30\times3=90$ objects from the prompt list individually and compute its score with both the corresponding prompt and a random other one, providing upper and lower bounds for performance on this task. Training $K$ NeRFs provides some decomposition, but most objects are scattered across 2 or 3 models. Learning one layout alleviates some of these issues, but only with multiple layouts do we see strong disentanglement. We show two representative examples of emergent objects to visualize these differences.
  • Figure 4: Conditional optimization. We can take advantage of our structured representation to learn a scene given a 3D asset in addition to a text prompt, such as a specific cat or motorcycle (a). By freezing the NeRF weights but not the layout weights, the model learns to arrange the provided asset in the context of the other objects it discovers (b). We show the entire composite scenes the model creates in (c) from two views, along with surface normals and a textureless render.
  • Figure 5: Layout diversity. Our method discovers different plausible arrangements for objects. Here, we optimize each example over $N=4$ layouts and show differences in composited scenes, e.g. flamingos wading inside vs. beside the pond, and cats in different poses around the snooker table.
  • ...and 3 more figures