Table of Contents
Fetching ...

Graph Canvas for Controllable 3D Scene Generation

Libin Liu, Shen Chen, Sen Jia, Jingzhe Shi, Zhongyu Jiang, Can Jin, Wu Zongkai, Jenq-Neng Hwang, Lei Li

TL;DR

GraphCanvas3D tackles the rigidity of contemporary 3D scene generation by introducing a graph-based, programmable representation that supports real-time, retraining-free edits and 4D scene evolution. It builds a hierarchical optimization pipeline—edge-, subgraph-, and graph-level—driven by Multimodal Large Language Models (MLLMs) to enforce spatial coherence and semantic consistency from concise prompts. Key contributions include a modular graph framework, real-time dynamic modification, and comprehensive evaluation showing improved usability, flexibility, and adaptability, with code available at the project website. This approach advances spatial intelligence in interactive 3D environments and is poised to enable scalable, configurable 3D/4D scene generation in applications ranging from VR/AR to intelligent robotic manipulation.

Abstract

Spatial intelligence is foundational to AI systems that interact with the physical world, particularly in 3D scene generation and spatial comprehension. Current methodologies for 3D scene generation often rely heavily on predefined datasets, and struggle to adapt dynamically to changing spatial relationships. In this paper, we introduce GraphCanvas3D, a programmable, extensible, and adaptable framework for controllable 3D scene generation. Leveraging in-context learning, GraphCanvas3D enables dynamic adaptability without the need for retraining, supporting flexible and customizable scene creation. Our framework employs hierarchical, graph-driven scene descriptions, representing spatial elements as graph nodes and establishing coherent relationships among objects in 3D environments. Unlike conventional approaches, which are constrained in adaptability and often require predefined input masks or retraining for modifications, GraphCanvas3D allows for seamless object manipulation and scene adjustments on the fly. Additionally, GraphCanvas3D supports 4D scene generation, incorporating temporal dynamics to model changes over time. Experimental results and user studies demonstrate that GraphCanvas3D enhances usability, flexibility, and adaptability for scene generation. Our code and models are available on the project website: https://github.com/ILGLJ/Graph-Canvas.

Graph Canvas for Controllable 3D Scene Generation

TL;DR

GraphCanvas3D tackles the rigidity of contemporary 3D scene generation by introducing a graph-based, programmable representation that supports real-time, retraining-free edits and 4D scene evolution. It builds a hierarchical optimization pipeline—edge-, subgraph-, and graph-level—driven by Multimodal Large Language Models (MLLMs) to enforce spatial coherence and semantic consistency from concise prompts. Key contributions include a modular graph framework, real-time dynamic modification, and comprehensive evaluation showing improved usability, flexibility, and adaptability, with code available at the project website. This approach advances spatial intelligence in interactive 3D environments and is poised to enable scalable, configurable 3D/4D scene generation in applications ranging from VR/AR to intelligent robotic manipulation.

Abstract

Spatial intelligence is foundational to AI systems that interact with the physical world, particularly in 3D scene generation and spatial comprehension. Current methodologies for 3D scene generation often rely heavily on predefined datasets, and struggle to adapt dynamically to changing spatial relationships. In this paper, we introduce GraphCanvas3D, a programmable, extensible, and adaptable framework for controllable 3D scene generation. Leveraging in-context learning, GraphCanvas3D enables dynamic adaptability without the need for retraining, supporting flexible and customizable scene creation. Our framework employs hierarchical, graph-driven scene descriptions, representing spatial elements as graph nodes and establishing coherent relationships among objects in 3D environments. Unlike conventional approaches, which are constrained in adaptability and often require predefined input masks or retraining for modifications, GraphCanvas3D allows for seamless object manipulation and scene adjustments on the fly. Additionally, GraphCanvas3D supports 4D scene generation, incorporating temporal dynamics to model changes over time. Experimental results and user studies demonstrate that GraphCanvas3D enhances usability, flexibility, and adaptability for scene generation. Our code and models are available on the project website: https://github.com/ILGLJ/Graph-Canvas.

Paper Structure

This paper contains 20 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: We propose GraphCanvas3D for controllable 3D scene generation, employing an optimized graph-based structure to facilitate precise and efficient layout design, such as bedrooms, thereby surpassing alternative methods in spatial control and configurability.
  • Figure 2: Overview of our method. Given a brief scene description, our method first allows the LLMs to construct a graph structure to manage the objects mentioned in the scene prompt and the relationships between them. Additionally, each object in the scene is provided with a richer description and passes through a 3D generative model to create corresponding 3D objects. We capture views of these 3D objects and let the MLLMs analyze whether the relative positions between objects are accurate. Ultimately, we achieve excellent results in terms of scene layout and rendering quality.
  • Figure 3: Edge Optimization Process. When optimizing an edge, we capture the 3D scene from four different viewpoints to obtain images from these perspectives. These four images are then sent along with an optimized prompt into the MLLMs, which analyzes the inherent relationships between objects across the images and provides corresponding scores. These scores serve as references for optimizing this edge. After passing through penalty function, the scores are propagated to the scene, guiding iterative optimization.
  • Figure 4: Qualitative Comparisons of Text-to-3D Scene Generation Approaches. Our method generates high-quality, interactive multi-object scenes and complex compositions that closely follow input textual descriptions. In the final column of the figure, we present the graph structure of the GraphCanvas3D method before rendering. GraphCanvas3D’s graph structure represents the 3D scene with nodes for objects and edges for their spatial relationships, ensuring consistency and coherence in scene generation.
  • Figure 5: Experiments of Dynamic Scene Modification. GraphCanvas3D is capable of object editing, adding, deleting and 4D scene generation based on textual descriptions.
  • ...and 4 more figures