Table of Contents
Fetching ...

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, Russ Tedrake

TL;DR

SceneSmith tackles the gap between diverse real indoor spaces and existing simulators by introducing a hierarchical, agentic pipeline that jointly generates simulation-ready assets and layouts from natural-language prompts. It leverages a multi-agent loop (Designer, Critic, Orchestrator) across a scene hierarchy, integrating text-to-3D asset synthesis, articulated-object retrieval, and physics-aware placement to produce dense, physically plausible environments. The approach yields major improvements in object density and physics fidelity, outperforming baselines in human judgments and automated SceneEval metrics, and supports an end-to-end robot policy evaluation pipeline from task descriptions to automatic success verification. This enables scalable training and evaluation of home-robot policies in varied, cluttered, and manipulable indoor scenes, with broad implications for robotics, simulation, and related AI research.

Abstract

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

TL;DR

SceneSmith tackles the gap between diverse real indoor spaces and existing simulators by introducing a hierarchical, agentic pipeline that jointly generates simulation-ready assets and layouts from natural-language prompts. It leverages a multi-agent loop (Designer, Critic, Orchestrator) across a scene hierarchy, integrating text-to-3D asset synthesis, articulated-object retrieval, and physics-aware placement to produce dense, physically plausible environments. The approach yields major improvements in object density and physics fidelity, outperforming baselines in human judgments and automated SceneEval metrics, and supports an end-to-end robot policy evaluation pipeline from task descriptions to automatic success verification. This enables scalable training and evaluation of home-robot policies in varied, cluttered, and manipulable indoor scenes, with broad implications for robotics, simulation, and related AI research.

Abstract

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stagesfrom architectural layout to furniture placement to small object populationeach implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.
Paper Structure (112 sections, 4 equations, 32 figures, 17 tables)

This paper contains 112 sections, 4 equations, 32 figures, 17 tables.

Figures (32)

  • Figure 1: Fully automated text-to-scene generation. This entire community center was generated by SceneSmith without any human intervention, from a single 151-word text prompt (full prompt in Appendix \ref{['app:prompts_house']}). Beyond explicitly specified elements, SceneSmith places additional objects from inferred contextual information, such as ping pong paddles and balls placed near a ping pong table. Objects are generated on-demand, are fully separable (non-composite), and include estimated physical properties, enabling direct interaction within a simulation. The resulting scenes are immediately usable in arbitrary physics simulators (robots added for demonstration).
  • Figure 2: SceneSmith's hierarchical scene construction pipeline. A scene prompt $\mathcal{T}$ is processed by a layout agent to generate architectural geometry for $M$ rooms. Each room is then independently populated through furniture, wall-mounted, and ceiling-mounted stages using room-specific prompts $\mathcal{T}_j$. In each room, $K_j$ supporting entities subsequently form additional branches populated with manipulable objects using entity-specific prompts $\mathcal{T}_{j,k}$. Colored highlights indicate objects added at each stage. Each stage (colored triangle) is implemented as an agentic interaction between a Designer, Critic, and Orchestrator. Stacked frames indicate parallel branches.
  • Figure 3: Text-to-3D asset generation pipeline. Given an object description, we generate an image, segment the foreground, and reconstruct a textured 3D mesh. The mesh is augmented with collision geometry (gray convex pieces) and physical properties estimated by a VLM, including mass, center of mass, friction, and inertia (blue ellipsoid). The mesh is also scaled to target dimensions.
  • Figure 4: Qualitative comparison with HSM and Holodeck, the two strongest baselines in our user study. SceneSmith produces denser scenes that better satisfy prompt requirements. See Appendix \ref{['app:qualitative_results']} for additional examples.
  • Figure 5: Designer-critic iteration during furniture placement. Each column shows a different room evolving through two rounds of critic feedback and designer refinement. Top row: Initial designs after the first designer pass, with scene prompts shown above. Middle row: Scenes after the first critique-and-improve cycle. Bottom row: Scenes after the second cycle. Text annotations describe the changes made at each step. The bedroom (left) progressively improves bunk bed placement. The dining room (center) refines chair orientation and adds furnishings. The pharmacy (right) illustrates checkpoint rollback: when the designer's additions degrade the critic score, the orchestrator resets to the previous checkpoint and prompts a different approach.
  • ...and 27 more figures