Table of Contents
Fetching ...

Decorum: A Language-Based Approach For Style-Conditioned Synthesis of Indoor 3D Scenes

Kelly O. Marshall, Omid Poursaeed, Sergiu Oprea, Amit Kumar, Anushrut Jignasu, Chinmay Hegde, Yilei Li, Rakesh Ranjan

TL;DR

The paper tackles the challenge of controlling both layout and style in 3D indoor scene generation from natural language prompts. It proposes Decorum, a fully language-based pipeline that converts prompts into densely grounded annotations (via Prompt2Ann) and then into CSS layouts (via Ann2Layout), while grounding furniture through DecoRate, a text-based retrieval module guided by multimodal LLMs. Key contributions include the two-stage NL-to-scene pipeline, the DecoRate retrieval with substantial improvements in Top-K accuracy and a new Text Fidelity Ranking (TFR) metric, and comprehensive evaluation on the 3D-FRONT benchmark demonstrating competitive or superior performance. The work enables flexible, user-driven design for digital environments and lays groundwork for broader language-grounded, style-aware scene synthesis.

Abstract

3D indoor scene generation is an important problem for the design of digital and real-world environments. To automate this process, a scene generation model should be able to not only generate plausible scene layouts, but also take into consideration visual features and style preferences. Existing methods for this task exhibit very limited control over these attributes, only allowing text inputs in the form of simple object-level descriptions or pairwise spatial relationships. Our proposed method Decorum enables users to control the scene generation process with natural language by adopting language-based representations at each stage. This enables us to harness recent advancements in Large Language Models (LLMs) to model language-to-language mappings. In addition, we show that using a text-based representation allows us to select furniture for our scenes using a novel object retrieval method based on multimodal LLMs. Evaluations on the benchmark 3D-FRONT dataset show that our methods achieve improvements over existing work in text-conditioned scene synthesis and object retrieval.

Decorum: A Language-Based Approach For Style-Conditioned Synthesis of Indoor 3D Scenes

TL;DR

The paper tackles the challenge of controlling both layout and style in 3D indoor scene generation from natural language prompts. It proposes Decorum, a fully language-based pipeline that converts prompts into densely grounded annotations (via Prompt2Ann) and then into CSS layouts (via Ann2Layout), while grounding furniture through DecoRate, a text-based retrieval module guided by multimodal LLMs. Key contributions include the two-stage NL-to-scene pipeline, the DecoRate retrieval with substantial improvements in Top-K accuracy and a new Text Fidelity Ranking (TFR) metric, and comprehensive evaluation on the 3D-FRONT benchmark demonstrating competitive or superior performance. The work enables flexible, user-driven design for digital environments and lays groundwork for broader language-grounded, style-aware scene synthesis.

Abstract

3D indoor scene generation is an important problem for the design of digital and real-world environments. To automate this process, a scene generation model should be able to not only generate plausible scene layouts, but also take into consideration visual features and style preferences. Existing methods for this task exhibit very limited control over these attributes, only allowing text inputs in the form of simple object-level descriptions or pairwise spatial relationships. Our proposed method Decorum enables users to control the scene generation process with natural language by adopting language-based representations at each stage. This enables us to harness recent advancements in Large Language Models (LLMs) to model language-to-language mappings. In addition, we show that using a text-based representation allows us to select furniture for our scenes using a novel object retrieval method based on multimodal LLMs. Evaluations on the benchmark 3D-FRONT dataset show that our methods achieve improvements over existing work in text-conditioned scene synthesis and object retrieval.

Paper Structure

This paper contains 26 sections, 5 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Illustration of the Decorum pipeline. We finetune LLaMA to obtain both Prompt2Annotation and Annotation2Layout models to convert from user prompts to CSS scene layouts in a two-stage process. Using this intermediate annotation representation decouples the generation of spatial and stylistic elements, allowing us to separately perform furniture selection using the tagged objects. Purple coloring indicates language model modules that we train using LoRA and green coloring shows our DecoRate furniture retrieval method which relies entirely on pretrained LLMs.
  • Figure 2: Illustration of the DecoRate coarse-to-fine rating system for text-based object retrieval. The top half shows our coarse CLIP-based candidate generation process. Below that is a visualization of our fine-grained rating system using LLaVA-NeXT. The dotted blue box contains our method for assigning probability scores to the text tokens (blue) conditioned on each input object’s visual tokens. These token probabilities are then summed together and added to the object prior probabilities (described in \ref{['sec:prior']}) to rate each object.
  • Figure 3: Examples of prompt-conditioned scene generation for bedrooms (top) and living rooms (bottom) with Decorum. We condition on prompts taken from the test set generated by LLaMa (left) and out-of-distribution prompts in a different format (right). We show that our method can accurately generate 3D indoor scenes based on user prompts that specify both spatial and stylistic attributes.
  • Figure 4: Example of LayoutGPT text-conditioned output compared to Decorum for a sample text prompt. Because LayoutGPT does not incorporate information from the text prompt into its choice of objects, it cannot satisfy visual descriptions
  • Figure 5: Example of Decorum pipeline applied to a sample input. For this sample prompt we show the model's predicted annotation which is used for layout generation and object selection. We then show the description generated for each object and the corresponding 3D object retrieved for this description. Finally, we include renderings of the final scene created from arranging the selected furniture.
  • ...and 3 more figures