Table of Contents
Fetching ...

PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI

Yandan Yang, Baoxiong Jia, Peiyuan Zhi, Siyuan Huang

TL;DR

PhyScene addresses the challenge of generating indoor scenes that are both realistic and physically plausible for embodied AI agents. It proposes PhyScene, a conditional diffusion-based pipeline conditioned on floor plans and augmented with three guidance functions for physics and interactivity: collision avoidance, room-layout alignment, and reachability, integrated into the sampling process via $p(O=1|\mathbf{x}_t, \mathcal{F}) \propto p_\theta(\mathbf{x}_0|\mathcal{F}) \exp(\varphi(\mathbf{x}_t, \mathcal{F}))$ and related gradients. Objects are represented by semantic labels, size, pose, and a 32-dim shape feature $\mathbf{f}_i$, enabling retrieval of articulated assets from cross-dataset collections like 3D-FUTURE and GAPartNet. Experiments on 3D-FRONT-based data show PhyScene achieves state-of-the-art results on traditional perceptual metrics (FID, CKL, SCA, KID) and substantially improves physical plausibility (lower object-and-scene collisions, higher reachability) over baselines. This work enables scalable, physically credible indoor scene synthesis for embodied AI, with potential to accelerate skill acquisition in simulated environments.

Abstract

With recent developments in Embodied Artificial Intelligence (EAI) research, there has been a growing demand for high-quality, large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes, the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity, we introduce PhyScene, a novel method dedicated to generating interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents. Based on a conditional diffusion model for capturing scene layouts, we devise novel physics- and interactivity-based guidance mechanisms that integrate constraints from object collision, room layout, and object reachability. Through extensive experiments, we demonstrate that PhyScene effectively leverages these guidance functions for physically interactable scene synthesis, outperforming existing state-of-the-art scene synthesis methods by a large margin. Our findings suggest that the scenes generated by PhyScene hold considerable potential for facilitating diverse skill acquisition among agents within interactive environments, thereby catalyzing further advancements in embodied AI research. Project website: http://physcene.github.io.

PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI

TL;DR

PhyScene addresses the challenge of generating indoor scenes that are both realistic and physically plausible for embodied AI agents. It proposes PhyScene, a conditional diffusion-based pipeline conditioned on floor plans and augmented with three guidance functions for physics and interactivity: collision avoidance, room-layout alignment, and reachability, integrated into the sampling process via and related gradients. Objects are represented by semantic labels, size, pose, and a 32-dim shape feature , enabling retrieval of articulated assets from cross-dataset collections like 3D-FUTURE and GAPartNet. Experiments on 3D-FRONT-based data show PhyScene achieves state-of-the-art results on traditional perceptual metrics (FID, CKL, SCA, KID) and substantially improves physical plausibility (lower object-and-scene collisions, higher reachability) over baselines. This work enables scalable, physically credible indoor scene synthesis for embodied AI, with potential to accelerate skill acquisition in simulated environments.

Abstract

With recent developments in Embodied Artificial Intelligence (EAI) research, there has been a growing demand for high-quality, large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes, the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity, we introduce PhyScene, a novel method dedicated to generating interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents. Based on a conditional diffusion model for capturing scene layouts, we devise novel physics- and interactivity-based guidance mechanisms that integrate constraints from object collision, room layout, and object reachability. Through extensive experiments, we demonstrate that PhyScene effectively leverages these guidance functions for physically interactable scene synthesis, outperforming existing state-of-the-art scene synthesis methods by a large margin. Our findings suggest that the scenes generated by PhyScene hold considerable potential for facilitating diverse skill acquisition among agents within interactive environments, thereby catalyzing further advancements in embodied AI research. Project website: http://physcene.github.io.
Paper Structure (34 sections, 13 equations, 17 figures, 6 tables, 2 algorithms)

This paper contains 34 sections, 13 equations, 17 figures, 6 tables, 2 algorithms.

Figures (17)

  • Figure 1: Illustration of the PhyScene, physically interactable scene synthesis method to generate interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents.
  • Figure 2: Overview of PHYSCENE. We leverage diffusion models for capturing scene layout distributions and apply three distinct guidance functions for improving the physical plausibility and interactivity of generated scenes.
  • Figure 3: Visualization of floor-plan conditioned scene synthesis between PhyScene, ATISS, and DiffuScene. The red, purple, and blue boxes highlight collisions between objects, objects outside the floor plan, and unreachable areas to the embodied agent, respectively.
  • Figure 4: Generated scenes with articulated objects. We visualize the opening sequence of articulated objects (left) and the generated scenes with texture (right).
  • Figure 5: Ablation on Guidance. Results of different guidance with floor-plan conditions. For each ablation on guidance functions, we show four generated scenes (four columns) without guidance in the first row and mark the violation of constraints in red boxes. The second row shows the improvement after considering guidance functions in green boxes.
  • ...and 12 more figures