Table of Contents
Fetching ...

LAYOUTDREAMER: Physics-guided Layout for Text-to-3D Compositional Scene Generation

Yang Zhou, Zongjin He, Qixuan Li, Chao Wang

TL;DR

LayoutDreamer tackles the challenge of text-to-3D compositional scene generation by introducing a physics-guided pipeline that uses 3D Gaussian Splatting and directed scene graphs to initialize and arrange objects. It couples a dynamic camera roaming strategy with a two-stage, physics-informed layout energy function to enforce realism, non-penetration, and stable object relationships. The method achieves state-of-the-art performance on the T3Bench multiple-objects metric and offers scalable, editable scene layouts suitable for rapid expansion and practical use. This work advances controllability and physical plausibility in text-driven 3D scene synthesis, enabling more reliable production-ready assets and interactive editing workflows.

Abstract

Recently, the field of text-guided 3D scene generation has garnered significant attention. High-quality generation that aligns with physical realism and high controllability is crucial for practical 3D scene applications. However, existing methods face fundamental limitations: (i) difficulty capturing complex relationships between multiple objects described in the text, (ii) inability to generate physically plausible scene layouts, and (iii) lack of controllability and extensibility in compositional scenes. In this paper, we introduce LayoutDreamer, a framework that leverages 3D Gaussian Splatting (3DGS) to facilitate high-quality, physically consistent compositional scene generation guided by text. Specifically, given a text prompt, we convert it into a directed scene graph and adaptively adjust the density and layout of the initial compositional 3D Gaussians. Subsequently, dynamic camera adjustments are made based on the training focal point to ensure entity-level generation quality. Finally, by extracting directed dependencies from the scene graph, we tailor physical and layout energy to ensure both realism and flexibility. Comprehensive experiments demonstrate that LayoutDreamer outperforms other compositional scene generation quality and semantic alignment methods. Specifically, it achieves state-of-the-art (SOTA) performance in the multiple objects generation metric of T3Bench.

LAYOUTDREAMER: Physics-guided Layout for Text-to-3D Compositional Scene Generation

TL;DR

LayoutDreamer tackles the challenge of text-to-3D compositional scene generation by introducing a physics-guided pipeline that uses 3D Gaussian Splatting and directed scene graphs to initialize and arrange objects. It couples a dynamic camera roaming strategy with a two-stage, physics-informed layout energy function to enforce realism, non-penetration, and stable object relationships. The method achieves state-of-the-art performance on the T3Bench multiple-objects metric and offers scalable, editable scene layouts suitable for rapid expansion and practical use. This work advances controllability and physical plausibility in text-driven 3D scene synthesis, enabling more reliable production-ready assets and interactive editing workflows.

Abstract

Recently, the field of text-guided 3D scene generation has garnered significant attention. High-quality generation that aligns with physical realism and high controllability is crucial for practical 3D scene applications. However, existing methods face fundamental limitations: (i) difficulty capturing complex relationships between multiple objects described in the text, (ii) inability to generate physically plausible scene layouts, and (iii) lack of controllability and extensibility in compositional scenes. In this paper, we introduce LayoutDreamer, a framework that leverages 3D Gaussian Splatting (3DGS) to facilitate high-quality, physically consistent compositional scene generation guided by text. Specifically, given a text prompt, we convert it into a directed scene graph and adaptively adjust the density and layout of the initial compositional 3D Gaussians. Subsequently, dynamic camera adjustments are made based on the training focal point to ensure entity-level generation quality. Finally, by extracting directed dependencies from the scene graph, we tailor physical and layout energy to ensure both realism and flexibility. Comprehensive experiments demonstrate that LayoutDreamer outperforms other compositional scene generation quality and semantic alignment methods. Specifically, it achieves state-of-the-art (SOTA) performance in the multiple objects generation metric of T3Bench.

Paper Structure

This paper contains 24 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overall pipline of LayoutDreamer. Given a text prompt, LayoutDreamer convert it into a scene graph, identifying node objects and dependencies. It integrates the size and layout pool to generate initial compositional 3D Gaussians and employs a dynamic camera strategy for entity-level optimization. Energy terms are retrieved from the layout pool based on the scene graph to optimize two-stage layout energy under the principles of physics.
  • Figure 2: Comparisons with closed-source compositional text-to-3D methods. LayoutDreamer emphasizes the layout based on an understanding of physical principles.
  • Figure 3: Qualitative comparisons between LayoutDreamer with other text-to-3D methods. The prompts are derived from the standard compositional scene prompts and the multiple objects tracking prompt set provided by T$^{3}$Bench. LayoutDreamer generates disentangled scenes using the same text prompts, with a focus on layout informed by physical principles.
  • Figure 4: Visual results of the ablation studies. Experiments validate the effectiveness of the three core modules, highlighting the critical roles of scene optimization from coarse to fine layout and entity optimization based on an dynamic camera roaming strategy in compositional scene generation.
  • Figure 5: Validation cases of physical energy terms. The text prompts for Case 1, Case 2, Case 3 are as follows: "a clock hangs on a moldy cabinet", "a lamp on a table, with a bed beside the table" and "a bicycle leans against a table".
  • ...and 1 more figures