Table of Contents
Fetching ...

PAT3D: Physics-Augmented Text-to-3D Scene Generation

Guying Lin, Kemeng Huang, Michael Liu, Ruihan Gao, Hanke Chen, Lyuhao Chen, Beijia Lu, Taku Komura, Yuan Liu, Jun-Yan Zhu, Minchen Li

TL;DR

PAT3D tackles the problem of generating realistic, editable 3D scenes from natural language by tightly integrating vision-language reasoning with differentiable physics. The method introduces a physics-aware initialization and a simulation-in-the-loop optimization that enforces non-interpenetration, gravity-driven stability, and semantic fidelity to the prompt. Empirical results show superior physical plausibility, semantic alignment, and visual quality compared with prior approaches, and demonstrate practical usefulness for scene editing and robotic manipulation. This work advances physically grounded, controllable 3D scene generation and provides simulation-ready assets for downstream tasks.

Abstract

We introduce PAT3D, the first physics-augmented text-to-3D scene generation framework that integrates vision-language models with physics-based simulation to produce physically plausible, simulation-ready, and intersection-free 3D scenes. Given a text prompt, PAT3D generates 3D objects, infers their spatial relations, and organizes them into a hierarchical scene tree, which is then converted into initial conditions for simulation. A differentiable rigid-body simulator ensures realistic object interactions under gravity, driving the scene toward static equilibrium without interpenetrations. To further enhance scene quality, we introduce a simulation-in-the-loop optimization procedure that guarantees physical stability and non-intersection, while improving semantic consistency with the input prompt. Experiments demonstrate that PAT3D substantially outperforms prior approaches in physical plausibility, semantic consistency, and visual quality. Beyond high-quality generation, PAT3D uniquely enables simulation-ready 3D scenes for downstream tasks such as scene editing and robotic manipulation. Code and data will be released upon acceptance.

PAT3D: Physics-Augmented Text-to-3D Scene Generation

TL;DR

PAT3D tackles the problem of generating realistic, editable 3D scenes from natural language by tightly integrating vision-language reasoning with differentiable physics. The method introduces a physics-aware initialization and a simulation-in-the-loop optimization that enforces non-interpenetration, gravity-driven stability, and semantic fidelity to the prompt. Empirical results show superior physical plausibility, semantic alignment, and visual quality compared with prior approaches, and demonstrate practical usefulness for scene editing and robotic manipulation. This work advances physically grounded, controllable 3D scene generation and provides simulation-ready assets for downstream tasks.

Abstract

We introduce PAT3D, the first physics-augmented text-to-3D scene generation framework that integrates vision-language models with physics-based simulation to produce physically plausible, simulation-ready, and intersection-free 3D scenes. Given a text prompt, PAT3D generates 3D objects, infers their spatial relations, and organizes them into a hierarchical scene tree, which is then converted into initial conditions for simulation. A differentiable rigid-body simulator ensures realistic object interactions under gravity, driving the scene toward static equilibrium without interpenetrations. To further enhance scene quality, we introduce a simulation-in-the-loop optimization procedure that guarantees physical stability and non-intersection, while improving semantic consistency with the input prompt. Experiments demonstrate that PAT3D substantially outperforms prior approaches in physical plausibility, semantic consistency, and visual quality. Beyond high-quality generation, PAT3D uniquely enables simulation-ready 3D scenes for downstream tasks such as scene editing and robotic manipulation. Code and data will be released upon acceptance.

Paper Structure

This paper contains 39 sections, 8 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: PAT3D is the first text-to-3D scene generation framework that produces simulation-ready and intersection-free results. The left column shows results from direct depth-based arrangements, which suffer from object interpenetrations (top) and collapse under simulation due to inconsistent layouts (bottom). The middle column presents PAT3D results, where physically valid layouts remain stable under simulation. These high-quality scenes are immediately usable for downstream applications, including scene editing and robotic manipulation (right).
  • Figure 2: Overview of our text-to-3D scene generation pipeline. (a) Given an input text, a reference image is first generated to capture spatial relations among objects, from which 3D assets are generated using vision foundation models, and a scene tree is extracted using a VLM. (b) Assets are arranged into an initial layout using 3D priors from monocular depth estimation (left), then refined with the scene tree to produce an intersection-free configuration for simulation (right). (c) Forward simulation ensures physical plausibility but may distort semantics (left). We address this with simulation-in-the-loop optimization, enforcing semantic consistency and physical validity (right).
  • Figure 3: Comparison to baseline methods. The scenes are generated from our text prompts. OOM indicates out of memory.
  • Figure 4: Scene editing. We demonstrate the equilibrium state after addition and deletion operations: (a) initial scene, (b) deleting a book at the bottom, (c) deleting the pen holder, (d) adding a book on top.
  • Figure 5: Policy evaluation for robotic manipulation. Example of a successful and a failed grasp where the attempted action causes objects to topple.
  • ...and 7 more figures