Table of Contents
Fetching ...

Open-Universe Indoor Scene Generation using LLM Program Synthesis and Uncurated Object Databases

Rio Aguina-Kang, Maxim Gumin, Do Heon Han, Stewart Morris, Seung Jean Yoo, Aditya Ganeshan, R. Kenny Jones, Qiuhong Anna Wei, Kailiang Fu, Daniel Ritchie

TL;DR

The paper introduces an open-universe indoor scene generation framework that combines LLM-driven DSL program synthesis, gradient-based scene layout optimization, and vision-language model-guided retrieval of unannotated 3D meshes. By representing scenes as declarative programs and solving their induced constraint problems, the approach overcomes the need for large curated 3D datasets and supports arbitrary object categories. The system demonstrates superior performance to prior closed-universe methods and LayoutGPT in open-universe generation, and includes thorough ablations and timing analyses. It also presents a practical retrieval/orientation pipeline that handles category accuracy and object sizing from massive, unstructured mesh databases, enabling flexible, editable scene generation for design and simulation applications.

Abstract

We present a system for generating indoor scenes in response to text prompts. The prompts are not limited to a fixed vocabulary of scene descriptions, and the objects in generated scenes are not restricted to a fixed set of object categories -- we call this setting indoor scene generation. Unlike most prior work on indoor scene generation, our system does not require a large training dataset of existing 3D scenes. Instead, it leverages the world knowledge encoded in pre-trained large language models (LLMs) to synthesize programs in a domain-specific layout language that describe objects and spatial relations between them. Executing such a program produces a specification of a constraint satisfaction problem, which the system solves using a gradient-based optimization scheme to produce object positions and orientations. To produce object geometry, the system retrieves 3D meshes from a database. Unlike prior work which uses databases of category-annotated, mutually-aligned meshes, we develop a pipeline using vision-language models (VLMs) to retrieve meshes from massive databases of un-annotated, inconsistently-aligned meshes. Experimental evaluations show that our system outperforms generative models trained on 3D data for traditional, closed-universe scene generation tasks; it also outperforms a recent LLM-based layout generation method on open-universe scene generation.

Open-Universe Indoor Scene Generation using LLM Program Synthesis and Uncurated Object Databases

TL;DR

The paper introduces an open-universe indoor scene generation framework that combines LLM-driven DSL program synthesis, gradient-based scene layout optimization, and vision-language model-guided retrieval of unannotated 3D meshes. By representing scenes as declarative programs and solving their induced constraint problems, the approach overcomes the need for large curated 3D datasets and supports arbitrary object categories. The system demonstrates superior performance to prior closed-universe methods and LayoutGPT in open-universe generation, and includes thorough ablations and timing analyses. It also presents a practical retrieval/orientation pipeline that handles category accuracy and object sizing from massive, unstructured mesh databases, enabling flexible, editable scene generation for design and simulation applications.

Abstract

We present a system for generating indoor scenes in response to text prompts. The prompts are not limited to a fixed vocabulary of scene descriptions, and the objects in generated scenes are not restricted to a fixed set of object categories -- we call this setting indoor scene generation. Unlike most prior work on indoor scene generation, our system does not require a large training dataset of existing 3D scenes. Instead, it leverages the world knowledge encoded in pre-trained large language models (LLMs) to synthesize programs in a domain-specific layout language that describe objects and spatial relations between them. Executing such a program produces a specification of a constraint satisfaction problem, which the system solves using a gradient-based optimization scheme to produce object positions and orientations. To produce object geometry, the system retrieves 3D meshes from a database. Unlike prior work which uses databases of category-annotated, mutually-aligned meshes, we develop a pipeline using vision-language models (VLMs) to retrieve meshes from massive databases of un-annotated, inconsistently-aligned meshes. Experimental evaluations show that our system outperforms generative models trained on 3D data for traditional, closed-universe scene generation tasks; it also outperforms a recent LLM-based layout generation method on open-universe scene generation.
Paper Structure (37 sections, 14 equations, 13 figures, 5 tables)

This paper contains 37 sections, 14 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: A schematic overview of our system. Given a high-level natural language description of a scene (plus optional constraints on the room size and object density), an LLM-based program synthesizer produces a scene description program which specifies the objects in the scene and their spatial relations. Our layout optimizer module then solves the constraint satisfaction problem implied by this program to produce a concrete layout of objects in the scene. For each scene object, the object retrieval module finds an appropriate 3D mesh from a large, unannotated mesh database; the object orientation module then identifies its front-facing direction so that it can be correctly inserted into the scene.
  • Figure 2: An example program in our declarative scene description language (left) and the object layout produced by running this program through our layout optimizer (right). This scene depicts a small, cozy Italian restaurant.
  • Figure 3: Our scene description program synthesizer proceeds in three steps, each of which uses a large language model. First, the LLM is asked to generate a natural language description of all the objects in the scene, along with how and why they are spatially related to one another. Then, a sequence of two LLMs translate this description into code which declares objects and relations, respectively.
  • Figure 4: Adding repel forces to layout optimization allows objects to be appropriately spaced without exhaustively specifying explicit relations.
  • Figure 5: Our pipeline for open-universe 3D object retrieval. As a preprocess, we compute embeddings for each object in our 3D mesh database using a vision language model (VLM). Given a description and category of an object (both specified in the LLM-generated scene program), our system finds the $k$ nearest neighbors of the text description's VLM embedding in our database. These initial retrieval results are then re-ranked to prioritize objects with the correct category and further filtered to remove meshes which are the wrong category or which consist of multiple objects.
  • ...and 8 more figures