Table of Contents
Fetching ...

Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning

Xingjian Ran, Yixuan Li, Linning Xu, Mulin Yu, Bo Dai

TL;DR

DirectLayout tackles open-ended 3D indoor scene synthesis by generating numerical layouts directly from text through a three-stage pipeline (BEV, lifting, refinement). It leverages Chain-of-Thought Activation and a CoT-Grounded Generative Layout Reward to imbue spatial reasoning and generalization, and uses Iterative Asset-Layout Alignment to reconcile layouts with assets. The approach yields superior semantic alignment, physical plausibility, and fine-grained control compared to baselines and ablations on 15 scene categories. The work advances embodied AI and digital content creation by enabling instruction-faithful, open-vocabulary scene synthesis, while acknowledging trade-offs in inference time and object-density scalability.

Abstract

Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely on predefined constraints to optimize numerical layout that sacrifice flexibility. As a result, they fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions using generalizable spatial reasoning of large language models (LLMs). DirectLayout decomposes the generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting it into 3D space, and refining object placements. To enable explicit spatial reasoning and help the model grasp basic principles of object placement, we employ Chain-of-Thought (CoT) Activation based on the 3D-Front dataset. Additionally, we design CoT-Grounded Generative Layout Reward to enhance generalization and spatial planning. During inference, DirectLayout addresses asset-layout mismatches via Iterative Asset-Layout Alignment through in-context learning. Extensive experiments demonstrate that DirectLayout achieves impressive semantic consistency, generalization and physical plausibility.

Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning

TL;DR

DirectLayout tackles open-ended 3D indoor scene synthesis by generating numerical layouts directly from text through a three-stage pipeline (BEV, lifting, refinement). It leverages Chain-of-Thought Activation and a CoT-Grounded Generative Layout Reward to imbue spatial reasoning and generalization, and uses Iterative Asset-Layout Alignment to reconcile layouts with assets. The approach yields superior semantic alignment, physical plausibility, and fine-grained control compared to baselines and ablations on 15 scene categories. The work advances embodied AI and digital content creation by enabling instruction-faithful, open-vocabulary scene synthesis, while acknowledging trade-offs in inference time and object-density scalability.

Abstract

Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely on predefined constraints to optimize numerical layout that sacrifice flexibility. As a result, they fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions using generalizable spatial reasoning of large language models (LLMs). DirectLayout decomposes the generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting it into 3D space, and refining object placements. To enable explicit spatial reasoning and help the model grasp basic principles of object placement, we employ Chain-of-Thought (CoT) Activation based on the 3D-Front dataset. Additionally, we design CoT-Grounded Generative Layout Reward to enhance generalization and spatial planning. During inference, DirectLayout addresses asset-layout mismatches via Iterative Asset-Layout Alignment through in-context learning. Extensive experiments demonstrate that DirectLayout achieves impressive semantic consistency, generalization and physical plausibility.

Paper Structure

This paper contains 26 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Our method synthesizes 3D indoor scenes from text descriptions via direct numerical layout generation, demonstrating strong performance in both instruction compliance and physical plausibility. In contrast, existing methods often suffer from issues related to inappropriate placement and size, as highlighted by the red circles. Furthermore, they struggle to identify all the entities in fine-grained user instruction resulting in object omission, indicated by the yellow circles. All methods share the same assets generated by 3D object generation method to ensure a fair comparison.
  • Figure 2: Overview of our method.Training Stage: BEV Layout Generator is first fine-tuned on BEV layouts curated from the 3D-Front dataset, guided by CoT annotations generated by GPT-4o. Subsequently, it is further optimized through DPO, leveraging CoT-Grounded Generative Layout Reward derived from Spatial Evaluator (VLM) and Quantitative Evaluator (reasoning LLM). Inference Stage: Given a text prompt, BEV Layout Generator produces a 2D layout, which is then lifted to a 3D layout by 3D Layout Generator. Iterative Asset-Layout Alignment refines the 3D scene by using the Spatial Evaluator and Quantitative Evaluator to provide feedback to the layout generators, ensuring consistency between the layout and generated 3D assets from an object generator.
  • Figure 3: Qualitative comparisons with scene synthesis methods. We compared our generated scenes with existing methods across various scene types and coarse-to-fine prompt granularities. Our results demonstrate a better alignment with the text descriptions across different prompt granularities and scene types.
  • Figure 4: Ablation Study Results. The experiment validates the effectiveness of task decomposition and proposed CoT-Grounded Generative Layout Reward.
  • Figure 5: More generated results based on language instructions with different granularities.
  • ...and 1 more figures