Table of Contents
Fetching ...

Planner3D: LLM-enhanced graph prior meets 3D indoor scene explicit regularization

Yao Wei, Martin Renqiang Min, George Vosselman, Li Erran Li, Michael Ying Yang

TL;DR

Planner3D introduces an end-to-end pipeline that augments scene graphs with CLIP and large language model-derived priors to form richer hierarchical graph representations. A unified graph encoder then guides a dual-branch decoder that jointly generates 3D layouts and object shapes, with an explicit IoU-based layout regularization to reduce collisions. A diffusion-based shape branch, conditioned on shape codes from the graph, produces high-fidelity geometries decoded through a pre-trained VQ-VAE. On SG-FRONT, Planner3D delivers superior scene-level fidelity and improved scene graph consistency compared to state-of-the-art baselines, validated by quantitative metrics and a user study. The approach offers a practical pathway to realistic, controllable multi-object 3D indoor scenes with potential for broader deployment in design and content creation environments.

Abstract

Compositional 3D scene synthesis has diverse applications across a spectrum of industries such as robotics, films, and video games, as it closely mirrors the complexity of real-world multi-object environments. Conventional works typically employ shape retrieval based frameworks which naturally suffer from limited shape diversity. Recent progresses have been made in object shape generation with generative models such as diffusion models, which increases the shape fidelity. However, these approaches separately treat 3D shape generation and layout generation. The synthesized scenes are usually hampered by layout collision, which suggests that the scene-level fidelity is still under-explored. In this paper, we aim at generating realistic and reasonable 3D indoor scenes from scene graph. To enrich the priors of the given scene graph inputs, large language model is utilized to aggregate the global-wise features with local node-wise and edge-wise features. With a unified graph encoder, graph features are extracted to guide joint layout-shape generation. Additional regularization is introduced to explicitly constrain the produced 3D layouts. Benchmarked on the SG-FRONT dataset, our method achieves better 3D scene synthesis, especially in terms of scene-level fidelity. The source code will be released after publication.

Planner3D: LLM-enhanced graph prior meets 3D indoor scene explicit regularization

TL;DR

Planner3D introduces an end-to-end pipeline that augments scene graphs with CLIP and large language model-derived priors to form richer hierarchical graph representations. A unified graph encoder then guides a dual-branch decoder that jointly generates 3D layouts and object shapes, with an explicit IoU-based layout regularization to reduce collisions. A diffusion-based shape branch, conditioned on shape codes from the graph, produces high-fidelity geometries decoded through a pre-trained VQ-VAE. On SG-FRONT, Planner3D delivers superior scene-level fidelity and improved scene graph consistency compared to state-of-the-art baselines, validated by quantitative metrics and a user study. The approach offers a practical pathway to realistic, controllable multi-object 3D indoor scenes with potential for broader deployment in design and content creation environments.

Abstract

Compositional 3D scene synthesis has diverse applications across a spectrum of industries such as robotics, films, and video games, as it closely mirrors the complexity of real-world multi-object environments. Conventional works typically employ shape retrieval based frameworks which naturally suffer from limited shape diversity. Recent progresses have been made in object shape generation with generative models such as diffusion models, which increases the shape fidelity. However, these approaches separately treat 3D shape generation and layout generation. The synthesized scenes are usually hampered by layout collision, which suggests that the scene-level fidelity is still under-explored. In this paper, we aim at generating realistic and reasonable 3D indoor scenes from scene graph. To enrich the priors of the given scene graph inputs, large language model is utilized to aggregate the global-wise features with local node-wise and edge-wise features. With a unified graph encoder, graph features are extracted to guide joint layout-shape generation. Additional regularization is introduced to explicitly constrain the produced 3D layouts. Benchmarked on the SG-FRONT dataset, our method achieves better 3D scene synthesis, especially in terms of scene-level fidelity. The source code will be released after publication.
Paper Structure (22 sections, 16 equations, 10 figures, 8 tables)

This paper contains 22 sections, 16 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Overview of previous works and ours. The 3D scenes are rendered from top-down view (top right), side view (middle), and bottom-up view (bottom). Unlike previous works that show poor shape consistency or spatial arrangements, Planner3D can synthesize higher-fidelity 3D scenes which demonstrate more realistic layout configuration while preserving shape consistency and diversity.
  • Figure 2: Overview of Planner3D. Given a scene graph $\textbf{G}$, graph prior is enriched with LLM and CLIP to construct graph representation $\textbf{G}^\dag$, which is then used to model the distribution $Z$. After updating graph node representation by replacing layout vectors with random vectors $z$ sampled from $Z$, the updated graph representation $\textbf{G}^\ddag$ is input into the graph encoder $E_G$. Guided by the extracted graph features, consequently, the layout decoder $D_L$ generates 7-parameterized layouts, and the shape decoder $D_S$ works in conjunction with VQ-VAE to synthesize object shapes for graph nodes.
  • Figure 3: Pipeline of scene graph prior enhancement. (a) Using CLIP and LLM, the graph representation $\textbf{G}^\dag$ explicitly aggregates node-wise, edge-wise and global-wise textual representations of the given scene graph. (b) The architecture of the model $\phi$, which learns to model the distribution $Z\sim\mathcal{N} (\mu,\sigma)$ and is used to construct the graph representation $\textbf{G}^\ddag$.
  • Figure 4: $E_G$ extracts shared graph features for the dual-branch decoder. $D_L$ predicts 6-parameters 3D bounding boxes and their rotation angles. For $D_S$, the latent diffusion model with the denoiser $\epsilon_\theta$ is conditioned on the shape code $c$, and the pre-trained and frozen VQ-VAE is applied to encode GT SDFs into the target latent $s_0$ during training and decode the predicted latent back SDFs during inference.
  • Figure 5: Qualitative examples of bedroom, living room and dining room. Compared to Graph-to-3D dhamo2021graph and CommonScenes zhai2023commonscenes, Planner3D shows higher fidelity with less interpenetrating phenomena.
  • ...and 5 more figures