Table of Contents
Fetching ...

DisCo-Layout: Disentangling and Coordinating Semantic and Physical Refinement in a Multi-Agent Framework for 3D Indoor Layout Synthesis

Jialin Gao, Donghao Zhou, Mingjian Liang, Lihao Liu, Chi-Wing Fu, Xiaowei Hu, Pheng-Ann Heng

TL;DR

DisCo-Layout addresses generalization challenges in 3D indoor layout synthesis by separately refining semantic and physical aspects via two dedicated tools and coordinating them with a three-agent, VLM-based framework (planner, designer, evaluator). The planner derives placement rules and groups assets, the designer proposes initial poses, and the evaluator uses VQA to decide when refinements are needed, with SRT correcting high-level relationships and PRT resolving spatial collisions through a grid-matching algorithm. Empirical results show state-of-the-art performance with zero physical violations and high semantic coherence across diverse scenes and open-domain assets, outperforming baselines such as LayoutGPT, Holodeck, and LayoutVLM. The work highlights the value of modular, tool-based refinement and open-domain generalization for scalable, realistic 3D indoor layout synthesis, with code to be released for community use.

Abstract

3D indoor layout synthesis is crucial for creating virtual environments. Traditional methods struggle with generalization due to fixed datasets. While recent LLM and VLM-based approaches offer improved semantic richness, they often lack robust and flexible refinement, resulting in suboptimal layouts. We develop DisCo-Layout, a novel framework that disentangles and coordinates physical and semantic refinement. For independent refinement, our Semantic Refinement Tool (SRT) corrects abstract object relationships, while the Physical Refinement Tool (PRT) resolves concrete spatial issues via a grid-matching algorithm. For collaborative refinement, a multi-agent framework intelligently orchestrates these tools, featuring a planner for placement rules, a designer for initial layouts, and an evaluator for assessment. Experiments demonstrate DisCo-Layout's state-of-the-art performance, generating realistic, coherent, and generalizable 3D indoor layouts. Our code will be publicly available.

DisCo-Layout: Disentangling and Coordinating Semantic and Physical Refinement in a Multi-Agent Framework for 3D Indoor Layout Synthesis

TL;DR

DisCo-Layout addresses generalization challenges in 3D indoor layout synthesis by separately refining semantic and physical aspects via two dedicated tools and coordinating them with a three-agent, VLM-based framework (planner, designer, evaluator). The planner derives placement rules and groups assets, the designer proposes initial poses, and the evaluator uses VQA to decide when refinements are needed, with SRT correcting high-level relationships and PRT resolving spatial collisions through a grid-matching algorithm. Empirical results show state-of-the-art performance with zero physical violations and high semantic coherence across diverse scenes and open-domain assets, outperforming baselines such as LayoutGPT, Holodeck, and LayoutVLM. The work highlights the value of modular, tool-based refinement and open-domain generalization for scalable, realistic 3D indoor layout synthesis, with code to be released for community use.

Abstract

3D indoor layout synthesis is crucial for creating virtual environments. Traditional methods struggle with generalization due to fixed datasets. While recent LLM and VLM-based approaches offer improved semantic richness, they often lack robust and flexible refinement, resulting in suboptimal layouts. We develop DisCo-Layout, a novel framework that disentangles and coordinates physical and semantic refinement. For independent refinement, our Semantic Refinement Tool (SRT) corrects abstract object relationships, while the Physical Refinement Tool (PRT) resolves concrete spatial issues via a grid-matching algorithm. For collaborative refinement, a multi-agent framework intelligently orchestrates these tools, featuring a planner for placement rules, a designer for initial layouts, and an evaluator for assessment. Experiments demonstrate DisCo-Layout's state-of-the-art performance, generating realistic, coherent, and generalizable 3D indoor layouts. Our code will be publicly available.

Paper Structure

This paper contains 21 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Architecture comparison of LLM/VLM-based methods for 3D indoor scene synthesis.(a) Single-step methods directly predict layouts without refinement, leading to inconsistent results in complex scenarios. (b) Coupled refinement methods integrate physical and semantic refinement in a coupled process, causing interference and limiting flexibility. (c) The proposed DisCo-Layout introduces a multi-agent framework that disentangles and coordinates semantic and physical refinement, enabling iterative and adaptive synthesis through collaboration between planner, designer, and evaluator agents.
  • Figure 2: Pipeline of DisCo-Layout. Our method employs a multi-agent framework consisting of a planner, designer, and evaluator (Section \ref{['sec:agent']}). The planner determines placement rules based on asset properties, the designer predicts layout configurations for asset groups, and the evaluator assesses and refines the scene using a tool-use approach. Further, the refinement process is decoupled into the Semantic Refinement Tool (SRT) and Physical Refinement Tool (PRT), ensuring both contextual coherence and spatial consistency in the synthesized 3D indoor layouts (Section \ref{['sec:refine']}).
  • Figure 3: Visualization of the correction effects of our SRT and PRT. The top row (a) showcases the result of SRT, which focuses on semantic coherence. The bottom row (b) demonstrates the result of PRT, which enforces physical plausibility.
  • Figure 4: Qualitative comparison. Using the same prompts and assets as inputs, Disco-Layout demonstrates its ability to generate semantically and physically plausible layouts, while accurately reflecting the spatial intent of the given prompts.
  • Figure 5: Group-by-group visualization of our synthesized layout. Disco-layout progressively places three groups of assets for a living room, while immediately refining semantic and physical errors at each stage.
  • ...and 1 more figures