PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation
Qixuan Li, Chao Wang, Zongjin He, Yan Peng
TL;DR
PhiP-G tackles the challenge of assembling physically plausible, semantically accurate 3D scenes from complex textual prompts by fusing LLM-driven scene graph extraction, a fast 2D-to-3D asset pipeline based on 3D Gaussian Splatting, and a world-model guided two-stage layout. The approach eliminates the need for task-specific training while ensuring physical consistency through a Blender-based physical pool, a relationship classifier, a magnet-guided layout refinement, and a visual supervision loop. Empirical results show state-of-the-art CLIP semantic consistency and competitive T$^3$Bench quality with ~24x faster generation, validated by ablation and user studies. The work highlights the viability of multi-agent, physics-aware generation for scalable, high-quality compositional 3D scenes.
Abstract
Text-to-3D asset generation has achieved significant optimization under the supervision of 2D diffusion priors. However, when dealing with compositional scenes, existing methods encounter several challenges: 1). failure to ensure that composite scene layouts comply with physical laws; 2). difficulty in accurately capturing the assets and relationships described in complex scene descriptions; 3). limited autonomous asset generation capabilities among layout approaches leveraging large language models (LLMs). To avoid these compromises, we propose a novel framework for compositional scene generation, PhiP-G, which seamlessly integrates generation techniques with layout guidance based on a world model. Leveraging LLM-based agents, PhiP-G analyzes the complex scene description to generate a scene graph, and integrating a multimodal 2D generation agent and a 3D Gaussian generation method for targeted assets creation. For the stage of layout, PhiP-G employs a physical pool with adhesion capabilities and a visual supervision agent, forming a world model for layout prediction and planning. Extensive experiments demonstrate that PhiP-G significantly enhances the generation quality and physical rationality of the compositional scenes. Notably, PhiP-G attains state-of-the-art (SOTA) performance in CLIP scores, achieves parity with the leading methods in generation quality as measured by the T$^3$Bench, and improves efficiency by 24x.
