Table of Contents
Fetching ...

GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

Gege Gao, Weiyang Liu, Anpei Chen, Andreas Geiger, Bernhard Schölkopf

TL;DR

GraphDreamer tackles the challenge of grounding multi-object 3D scenes by moving beyond holistic text prompts to scene-graph–driven conditioning. It decomposes a scene graph into global, node-wise, and edge-wise textual descriptions and reinforces them with identity-aware object fields represented as Signed Distance Fields, enabling disentangled object geometry and appearance through scene-level and object-level SDS losses. Key contributions include a scalable, graph-guided 3D synthesis pipeline that avoids 3D bounding boxes, and demonstrated superiority over state-of-the-art TT3D methods via CLIP-based metrics and user studies, along with ablations that validate the importance of scene graphs and penalty terms. The approach has practical implications for controllable 3D scene generation from natural language, with avenues for extensions such as GPT4V-guided inverse semantics and improved object-part supervision.

Abstract

As pretrained text-to-image diffusion models become increasingly powerful, recent efforts have been made to distill knowledge from these text-to-image pretrained models for optimizing a text-guided 3D model. Most of the existing methods generate a holistic 3D model from a plain text input. This can be problematic when the text describes a complex scene with multiple objects, because the vectorized text embeddings are inherently unable to capture a complex description with multiple entities and relationships. Holistic 3D modeling of the entire scene further prevents accurate grounding of text entities and concepts. To address this limitation, we propose GraphDreamer, a novel framework to generate compositional 3D scenes from scene graphs, where objects are represented as nodes and their interactions as edges. By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model and is able to fully disentangle different objects without image-level supervision. To facilitate modeling of object-wise relationships, we use signed distance fields as representation and impose a constraint to avoid inter-penetration of objects. To avoid manual scene graph creation, we design a text prompt for ChatGPT to generate scene graphs based on text inputs. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer in generating high-fidelity compositional 3D scenes with disentangled object entities.

GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

TL;DR

GraphDreamer tackles the challenge of grounding multi-object 3D scenes by moving beyond holistic text prompts to scene-graph–driven conditioning. It decomposes a scene graph into global, node-wise, and edge-wise textual descriptions and reinforces them with identity-aware object fields represented as Signed Distance Fields, enabling disentangled object geometry and appearance through scene-level and object-level SDS losses. Key contributions include a scalable, graph-guided 3D synthesis pipeline that avoids 3D bounding boxes, and demonstrated superiority over state-of-the-art TT3D methods via CLIP-based metrics and user studies, along with ablations that validate the importance of scene graphs and penalty terms. The approach has practical implications for controllable 3D scene generation from natural language, with avenues for extensions such as GPT4V-guided inverse semantics and improved object-part supervision.

Abstract

As pretrained text-to-image diffusion models become increasingly powerful, recent efforts have been made to distill knowledge from these text-to-image pretrained models for optimizing a text-guided 3D model. Most of the existing methods generate a holistic 3D model from a plain text input. This can be problematic when the text describes a complex scene with multiple objects, because the vectorized text embeddings are inherently unable to capture a complex description with multiple entities and relationships. Holistic 3D modeling of the entire scene further prevents accurate grounding of text entities and concepts. To address this limitation, we propose GraphDreamer, a novel framework to generate compositional 3D scenes from scene graphs, where objects are represented as nodes and their interactions as edges. By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model and is able to fully disentangle different objects without image-level supervision. To facilitate modeling of object-wise relationships, we use signed distance fields as representation and impose a constraint to avoid inter-penetration of objects. To avoid manual scene graph creation, we design a text prompt for ChatGPT to generate scene graphs based on text inputs. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer in generating high-fidelity compositional 3D scenes with disentangled object entities.
Paper Structure (25 sections, 23 equations, 12 figures, 5 tables)

This paper contains 25 sections, 23 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: GraphDreamer takes a scene graph as input and generates a compositional 3D scene where each object is fully disentangled. To save the effort of building a scene graph from scratch, the scene graph can be generated by a language model (e.g., ChatGPT) from a user text input (left box).
  • Figure 2: The overall pipeline of GraphDreamer. Specifically, GraphDreamer first decomposes the scene graph into global, node-wise and edge-wise text description, and then optimizes the SDF-based objects in the 3D scene using their corresponding text description.
  • Figure 3: Qualitative comparison with baseline approaches and the ablated configuration (w/o graph). GraphDreamer generates scenes with all composing objects being separable. Moreover, with accurate guidance from scene graphs, object attributes and inter-object relationships produced by GraphDreamer match the given prompts better. We recommend to zoom-in for details.
  • Figure 4: The CLIP scores of individual object images $C^{(i)}$. Metric w. Self Prompt refers to scores calculated between $C^{(i)}$ and its own prompt $y^{(i)}$, and w. Other Prompts between $C^{(i)}$ and prompts of all other objects in the same scene $\{y^{(j)}, j\neq i, o_j \in \mathcal{O}\}$. Detailed experimental settings and analysis on these figures as well as the on the chart showing in Figure \ref{['subfig:decompo']}, can be found in Subsection \ref{['subsec:analy_decompo']}.
  • Figure 5: Error bands of object CLIP scores.
  • ...and 7 more figures