Table of Contents
Fetching ...

Learning Object Placement Programs for Indoor Scene Synthesis with Iterative Self Training

Adrian Chang, Kai Wang, Yuanbo Li, Manolis Savva, Angel X. Chang, Daniel Ritchie

TL;DR

This work addresses the tendency of autoregressive indoor scene synthesis to produce incomplete per-object placement distributions. It introduces a relational, human-interpretable Domain Specific Language (DSL) of functional constraints to express object placement rules, and a transformer-based model that writes DSL programs which, when executed, yield feasible placement masks. A bootstrapped, unsupervised self-training loop (PLAD-inspired) plus a scene-classifier filter improves program quality and per-object distribution coverage, while a dedicated evaluation protocol compares distributions against human annotations. Results show improved alignment with human placement rules, comparable final scene quality, and robustness to data sparsity, highlighting the value of neurosymbolic approaches for controllable, diverse indoor scene generation.

Abstract

Data driven and autoregressive indoor scene synthesis systems generate indoor scenes automatically by suggesting and then placing objects one at a time. Empirical observations show that current systems tend to produce incomplete next object location distributions. We introduce a system which addresses this problem. We design a Domain Specific Language (DSL) that specifies functional constraints. Programs from our language take as input a partial scene and object to place. Upon execution they predict possible object placements. We design a generative model which writes these programs automatically. Available 3D scene datasets do not contain programs to train on, so we build upon previous work in unsupervised program induction to introduce a new program bootstrapping algorithm. In order to quantify our empirical observations we introduce a new evaluation procedure which captures how well a system models per-object location distributions. We ask human annotators to label all the possible places an object can go in a scene and show that our system produces per-object location distributions more consistent with human annotators. Our system also generates indoor scenes of comparable quality to previous systems and while previous systems degrade in performance when training data is sparse, our system does not degrade to the same degree.

Learning Object Placement Programs for Indoor Scene Synthesis with Iterative Self Training

TL;DR

This work addresses the tendency of autoregressive indoor scene synthesis to produce incomplete per-object placement distributions. It introduces a relational, human-interpretable Domain Specific Language (DSL) of functional constraints to express object placement rules, and a transformer-based model that writes DSL programs which, when executed, yield feasible placement masks. A bootstrapped, unsupervised self-training loop (PLAD-inspired) plus a scene-classifier filter improves program quality and per-object distribution coverage, while a dedicated evaluation protocol compares distributions against human annotations. Results show improved alignment with human placement rules, comparable final scene quality, and robustness to data sparsity, highlighting the value of neurosymbolic approaches for controllable, diverse indoor scene generation.

Abstract

Data driven and autoregressive indoor scene synthesis systems generate indoor scenes automatically by suggesting and then placing objects one at a time. Empirical observations show that current systems tend to produce incomplete next object location distributions. We introduce a system which addresses this problem. We design a Domain Specific Language (DSL) that specifies functional constraints. Programs from our language take as input a partial scene and object to place. Upon execution they predict possible object placements. We design a generative model which writes these programs automatically. Available 3D scene datasets do not contain programs to train on, so we build upon previous work in unsupervised program induction to introduce a new program bootstrapping algorithm. In order to quantify our empirical observations we introduce a new evaluation procedure which captures how well a system models per-object location distributions. We ask human annotators to label all the possible places an object can go in a scene and show that our system produces per-object location distributions more consistent with human annotators. Our system also generates indoor scenes of comparable quality to previous systems and while previous systems degrade in performance when training data is sparse, our system does not degrade to the same degree.

Paper Structure

This paper contains 33 sections, 3 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Inference Pipeline: Our system is autoregressive, placing objects one at a time. Given a (1) partial scene and query object we (2) sample a DSL program from a generative model. (3) Upon execution, this produces a binary mask of possible centroid locations for the query object in the scene. (4) Sampling this mask produces valid object locations of the query object.
  • Figure 2: We develop a program bootstrapping algorithm which discovers programs automatically from scene data. (1) We start by extracting programs with geometric heuristics and then training our (2) model on these initial programs. (3) We propose new programs by deleting constraints from both the inferred and original programs. (4) These candidate programs are then filtered with a scene real fake classifier to remove "bad" programs. (5) Domain specific operations combine "good" candidate programs together, and insert them back into the training set.
  • Figure 3: Program Example: Given a partial scene and object to add, our DSL program outputs a binary mask representing possible placements of that object. Programs take on the structure of Constructive Solid Geometry (CSG) trees where each leaf node is a functional constraint that describes object function. Upon execution, these constraints produce binary masks which are combined according to the structure of the tree.
  • Figure 4: Our generative model takes as input a partial scene and object and predicts a program written in our DSL. We formulate this modeling task as a seq-2-seq problem. (1) We first vectorize and then embed both the input objects and program. The structure of the program tree and the constraint attributes are embedded as seperate sequences. (2) Our first transformer encoder decoder pair predicts the structure of the program from the input objects. (3) Our second transformer encoder decoder pair predicts the constraint attributes from the object and structure embeddings.
  • Figure 5: Given two programs that produce different placement modes, we combine them into a new program with domain specific operations. We create a new tree with an "or" node as its root. The two programs are set as its children.
  • ...and 14 more figures