Table of Contents
Fetching ...

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

Yandan Yang, Baoxiong Jia, Shujie Zhang, Siyuan Huang

TL;DR

SceneWeaver tackles the challenge of open-ended, physically plausible 3D indoor scene synthesis for embodied AI by introducing a reflective, agentic framework. It unifies diverse generation tools through a standardized interface and governs their use with a reason-act-reflect loop, powered by a physics-aware executor. The approach achieves state-of-the-art performance on both common and open-vocabulary room types, with zero collisions and boundary violations and strong instruction-following performance, as shown by quantitative metrics and human studies. Comprehensive ablations demonstrate the necessity of iterative reflection, tool diversity, and physics-based refinement for high-quality, configurable scene synthesis. The work advances towards general-purpose, controllable 3D environment generation and offers a scalable framework for integrating future scene-generation tools and assets.

Abstract

Indoor scene synthesis has become increasingly important with the rise of Embodied AI, which requires 3D environments that are not only visually realistic but also physically plausible and functionally diverse. While recent approaches have advanced visual fidelity, they often remain constrained to fixed scene categories, lack sufficient object-level detail and physical consistency, and struggle to align with complex user instructions. In this work, we present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement. At its core, SceneWeaver employs a language model-based planner to select from a suite of extensible scene generation tools, ranging from data-driven generative models to visual- and LLM-based methods, guided by self-evaluation of physical plausibility, visual realism, and semantic alignment with user input. This closed-loop reason-act-reflect design enables the agent to identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations. Extensive experiments on both common and open-vocabulary room types demonstrate that SceneWeaver not only outperforms prior methods on physical, visual, and semantic metrics, but also generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation. Project website: https://scene-weaver.github.io/.

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

TL;DR

SceneWeaver tackles the challenge of open-ended, physically plausible 3D indoor scene synthesis for embodied AI by introducing a reflective, agentic framework. It unifies diverse generation tools through a standardized interface and governs their use with a reason-act-reflect loop, powered by a physics-aware executor. The approach achieves state-of-the-art performance on both common and open-vocabulary room types, with zero collisions and boundary violations and strong instruction-following performance, as shown by quantitative metrics and human studies. Comprehensive ablations demonstrate the necessity of iterative reflection, tool diversity, and physics-based refinement for high-quality, configurable scene synthesis. The work advances towards general-purpose, controllable 3D environment generation and offers a scalable framework for integrating future scene-generation tools and assets.

Abstract

Indoor scene synthesis has become increasingly important with the rise of Embodied AI, which requires 3D environments that are not only visually realistic but also physically plausible and functionally diverse. While recent approaches have advanced visual fidelity, they often remain constrained to fixed scene categories, lack sufficient object-level detail and physical consistency, and struggle to align with complex user instructions. In this work, we present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement. At its core, SceneWeaver employs a language model-based planner to select from a suite of extensible scene generation tools, ranging from data-driven generative models to visual- and LLM-based methods, guided by self-evaluation of physical plausibility, visual realism, and semantic alignment with user input. This closed-loop reason-act-reflect design enables the agent to identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations. Extensive experiments on both common and open-vocabulary room types demonstrate that SceneWeaver not only outperforms prior methods on physical, visual, and semantic metrics, but also generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation. Project website: https://scene-weaver.github.io/.

Paper Structure

This paper contains 133 sections, 22 figures, 24 tables.

Figures (22)

  • Figure 1: Overview of SceneWeaver, a reflective agentic framework built on standardized and extensible tool interfaces that unifies the strengths of existing scene synthesis methods to produce visually realistic, physically plausible, instruction-aligned 3D scenes.
  • Figure 2: The SceneWeaver pipeline. Following a reason–act–reflect paradigm, SceneWeaver iteratively refines scenes by integrating the strengths of diverse scene synthesis tools.
  • Figure 3: A visualization of standardized tool interfaces and the reflective planning process. The self-reflective planner leverages diverse tools to first correct the misoriented laundry machine and then enhance scene details by adding small objects to the shelf (right).
  • Figure 4: Qualitative comparison between SceneWeaver and existing methods on both synthesizing common room types and open-vocabulary room types. SceneWeaver produces scenes with improved visual realism and finer-grained detail compared to prior methods.
  • Figure 5: Iterative refinement in SceneWeaver given complex user queries.SceneWeaver progressively incorporates detailed elements specified in the user instruction, demonstrating its ability to iteratively refine and generate high-quality, instruction-aligned 3D scenes (best viewed with zoom-in).
  • ...and 17 more figures