Table of Contents
Fetching ...

PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation

Qixuan Li, Chao Wang, Zongjin He, Yan Peng

TL;DR

PhiP-G tackles the challenge of assembling physically plausible, semantically accurate 3D scenes from complex textual prompts by fusing LLM-driven scene graph extraction, a fast 2D-to-3D asset pipeline based on 3D Gaussian Splatting, and a world-model guided two-stage layout. The approach eliminates the need for task-specific training while ensuring physical consistency through a Blender-based physical pool, a relationship classifier, a magnet-guided layout refinement, and a visual supervision loop. Empirical results show state-of-the-art CLIP semantic consistency and competitive T$^3$Bench quality with ~24x faster generation, validated by ablation and user studies. The work highlights the viability of multi-agent, physics-aware generation for scalable, high-quality compositional 3D scenes.

Abstract

Text-to-3D asset generation has achieved significant optimization under the supervision of 2D diffusion priors. However, when dealing with compositional scenes, existing methods encounter several challenges: 1). failure to ensure that composite scene layouts comply with physical laws; 2). difficulty in accurately capturing the assets and relationships described in complex scene descriptions; 3). limited autonomous asset generation capabilities among layout approaches leveraging large language models (LLMs). To avoid these compromises, we propose a novel framework for compositional scene generation, PhiP-G, which seamlessly integrates generation techniques with layout guidance based on a world model. Leveraging LLM-based agents, PhiP-G analyzes the complex scene description to generate a scene graph, and integrating a multimodal 2D generation agent and a 3D Gaussian generation method for targeted assets creation. For the stage of layout, PhiP-G employs a physical pool with adhesion capabilities and a visual supervision agent, forming a world model for layout prediction and planning. Extensive experiments demonstrate that PhiP-G significantly enhances the generation quality and physical rationality of the compositional scenes. Notably, PhiP-G attains state-of-the-art (SOTA) performance in CLIP scores, achieves parity with the leading methods in generation quality as measured by the T$^3$Bench, and improves efficiency by 24x.

PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation

TL;DR

PhiP-G tackles the challenge of assembling physically plausible, semantically accurate 3D scenes from complex textual prompts by fusing LLM-driven scene graph extraction, a fast 2D-to-3D asset pipeline based on 3D Gaussian Splatting, and a world-model guided two-stage layout. The approach eliminates the need for task-specific training while ensuring physical consistency through a Blender-based physical pool, a relationship classifier, a magnet-guided layout refinement, and a visual supervision loop. Empirical results show state-of-the-art CLIP semantic consistency and competitive TBench quality with ~24x faster generation, validated by ablation and user studies. The work highlights the viability of multi-agent, physics-aware generation for scalable, high-quality compositional 3D scenes.

Abstract

Text-to-3D asset generation has achieved significant optimization under the supervision of 2D diffusion priors. However, when dealing with compositional scenes, existing methods encounter several challenges: 1). failure to ensure that composite scene layouts comply with physical laws; 2). difficulty in accurately capturing the assets and relationships described in complex scene descriptions; 3). limited autonomous asset generation capabilities among layout approaches leveraging large language models (LLMs). To avoid these compromises, we propose a novel framework for compositional scene generation, PhiP-G, which seamlessly integrates generation techniques with layout guidance based on a world model. Leveraging LLM-based agents, PhiP-G analyzes the complex scene description to generate a scene graph, and integrating a multimodal 2D generation agent and a 3D Gaussian generation method for targeted assets creation. For the stage of layout, PhiP-G employs a physical pool with adhesion capabilities and a visual supervision agent, forming a world model for layout prediction and planning. Extensive experiments demonstrate that PhiP-G significantly enhances the generation quality and physical rationality of the compositional scenes. Notably, PhiP-G attains state-of-the-art (SOTA) performance in CLIP scores, achieves parity with the leading methods in generation quality as measured by the TBench, and improves efficiency by 24x.

Paper Structure

This paper contains 18 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: PhiP-G is dedicated to understanding complex scene descriptions and generating high-quality 3D compositional scenes while supporting the generation of special scene relationships. Compared with existing generation methods, our method demonstrates excellent physical consistency and the ability to handle special environmental relationships.
  • Figure 2: Overview of PhiP-G. Given a complex scene description, PhiP-G employs an LLM-based agent to perform text analysis and construct a scene graph. Graph-based 3D asset generation is carried out using a 2D generation agent and the 3D Gaussian model, where the 2D asset with the highest CLIP score is stored in the 2D retrieval library for future use. Subsequently, Blender serves as the foundational environment, where a world model consisting of the physical pool and a visual supervision agent enables coarse layout and iterative refinement. PhiP-G ensures improved semantic consistency and physical coherence in the generated scene.
  • Figure 3: Qualitative analysis of text-to-3D scene. Our method ensures consistency between textual descriptions and generated 3D scenes, while maintaining physical laws and handling special layout requirements.
  • Figure 4: Visualization depicting the ablation of key steps. This ablation experiment visually demonstrates the effectiveness and necessity of each layout module we design.
  • Figure 5: Example of agent AG-extractor prompt.
  • ...and 3 more figures