Learning Object Placement Programs for Indoor Scene Synthesis with Iterative Self Training
Adrian Chang, Kai Wang, Yuanbo Li, Manolis Savva, Angel X. Chang, Daniel Ritchie
TL;DR
This work addresses the tendency of autoregressive indoor scene synthesis to produce incomplete per-object placement distributions. It introduces a relational, human-interpretable Domain Specific Language (DSL) of functional constraints to express object placement rules, and a transformer-based model that writes DSL programs which, when executed, yield feasible placement masks. A bootstrapped, unsupervised self-training loop (PLAD-inspired) plus a scene-classifier filter improves program quality and per-object distribution coverage, while a dedicated evaluation protocol compares distributions against human annotations. Results show improved alignment with human placement rules, comparable final scene quality, and robustness to data sparsity, highlighting the value of neurosymbolic approaches for controllable, diverse indoor scene generation.
Abstract
Data driven and autoregressive indoor scene synthesis systems generate indoor scenes automatically by suggesting and then placing objects one at a time. Empirical observations show that current systems tend to produce incomplete next object location distributions. We introduce a system which addresses this problem. We design a Domain Specific Language (DSL) that specifies functional constraints. Programs from our language take as input a partial scene and object to place. Upon execution they predict possible object placements. We design a generative model which writes these programs automatically. Available 3D scene datasets do not contain programs to train on, so we build upon previous work in unsupervised program induction to introduce a new program bootstrapping algorithm. In order to quantify our empirical observations we introduce a new evaluation procedure which captures how well a system models per-object location distributions. We ask human annotators to label all the possible places an object can go in a scene and show that our system produces per-object location distributions more consistent with human annotators. Our system also generates indoor scenes of comparable quality to previous systems and while previous systems degrade in performance when training data is sparse, our system does not degrade to the same degree.
