Table of Contents
Fetching ...

GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis

Antonio Ruiz, Tao Wu, Andrew Melnik, Qing Cheng, Xuqin Wang, Lu Liu, Yongliang Wang, Yanfeng Zhang, Helge Ritter

TL;DR

GeoSceneGraph addresses the challenge of text-driven 3D indoor scene synthesis by leveraging the inherent graph structure and geometric symmetries of scenes without relying on predefined relationship vocabularies. It introduces a diffusion model built on $SE(3)$-equivariant graph neural networks (EGNNs), with a novel text-conditioning approach that fuses text and time-step information via a ResNet and Transformer before integrating it into the EGNN's message passing. The method uses a text-aligned shape autoencoder based on OpenCLIP embeddings reduced through a VAE to generate continuous, text-consistent shape codes, enabling flexible open-vocabulary control. Experimental results on 3D-FRONT-based datasets show competitive generation quality and controllability against strong baselines, with ablations confirming that time-conditioned, per-step text integration into EGNNs yields superior performance and robust zero-shot capabilities. This work advances efficient, graph-aware 3D scene synthesis suitable for resource-constrained deployment and embodied AI applications.

Abstract

Methods that synthesize indoor 3D scenes from text prompts have wide-ranging applications in film production, interior design, video games, virtual reality, and synthetic data generation for training embodied agents. Existing approaches typically either train generative models from scratch or leverage vision-language models (VLMs). While VLMs achieve strong performance, particularly for complex or open-ended prompts, smaller task-specific models remain necessary for deployment on resource-constrained devices such as extended reality (XR) glasses or mobile phones. However, many generative approaches that train from scratch overlook the inherent graph structure of indoor scenes, which can limit scene coherence and realism. Conversely, methods that incorporate scene graphs either demand a user-provided semantic graph, which is generally inconvenient and restrictive, or rely on ground-truth relationship annotations, limiting their capacity to capture more varied object interactions. To address these challenges, we introduce GeoSceneGraph, a method that synthesizes 3D scenes from text prompts by leveraging the graph structure and geometric symmetries of 3D scenes, without relying on predefined relationship classes. Despite not using ground-truth relationships, GeoSceneGraph achieves performance comparable to methods that do. Our model is built on equivariant graph neural networks (EGNNs), but existing EGNN approaches are typically limited to low-dimensional conditioning and are not designed to handle complex modalities such as text. We propose a simple and effective strategy for conditioning EGNNs on text features, and we validate our design through ablation studies.

GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis

TL;DR

GeoSceneGraph addresses the challenge of text-driven 3D indoor scene synthesis by leveraging the inherent graph structure and geometric symmetries of scenes without relying on predefined relationship vocabularies. It introduces a diffusion model built on -equivariant graph neural networks (EGNNs), with a novel text-conditioning approach that fuses text and time-step information via a ResNet and Transformer before integrating it into the EGNN's message passing. The method uses a text-aligned shape autoencoder based on OpenCLIP embeddings reduced through a VAE to generate continuous, text-consistent shape codes, enabling flexible open-vocabulary control. Experimental results on 3D-FRONT-based datasets show competitive generation quality and controllability against strong baselines, with ablations confirming that time-conditioned, per-step text integration into EGNNs yields superior performance and robust zero-shot capabilities. This work advances efficient, graph-aware 3D scene synthesis suitable for resource-constrained deployment and embodied AI applications.

Abstract

Methods that synthesize indoor 3D scenes from text prompts have wide-ranging applications in film production, interior design, video games, virtual reality, and synthetic data generation for training embodied agents. Existing approaches typically either train generative models from scratch or leverage vision-language models (VLMs). While VLMs achieve strong performance, particularly for complex or open-ended prompts, smaller task-specific models remain necessary for deployment on resource-constrained devices such as extended reality (XR) glasses or mobile phones. However, many generative approaches that train from scratch overlook the inherent graph structure of indoor scenes, which can limit scene coherence and realism. Conversely, methods that incorporate scene graphs either demand a user-provided semantic graph, which is generally inconvenient and restrictive, or rely on ground-truth relationship annotations, limiting their capacity to capture more varied object interactions. To address these challenges, we introduce GeoSceneGraph, a method that synthesizes 3D scenes from text prompts by leveraging the graph structure and geometric symmetries of 3D scenes, without relying on predefined relationship classes. Despite not using ground-truth relationships, GeoSceneGraph achieves performance comparable to methods that do. Our model is built on equivariant graph neural networks (EGNNs), but existing EGNN approaches are typically limited to low-dimensional conditioning and are not designed to handle complex modalities such as text. We propose a simple and effective strategy for conditioning EGNNs on text features, and we validate our design through ablation studies.

Paper Structure

This paper contains 13 sections, 4 equations, 4 figures, 3 tables, 3 algorithms.

Figures (4)

  • Figure 1: Overall pipeline of our method. First, we encode the text prompt with a CLIP encoder, and use this text embedding to condition the diffusion process through all message passing steps in the denoising EGNN. Once object features are sampled with the denoising process, we generate the scene by retrieving objects via 1-NN search and positioning them with 3D coordinates and bounding box parameters.
  • Figure 2: EGNN Architecture. Our EGNN architecture denoises the noisy node features $[x + \epsilon_x,\, h + \epsilon_h]$ through a three-phase pipeline composed of MLP encoders, EGCL layers and MLP decoders. Our novelty is highlighted by the dotted red boxes.
  • Figure 3: Visual comparison of our method and baseline approaches for text-guided scene generation on the living room (top row), and dining room (bottom row) datasets.
  • Figure 4: Comparison visualizations of our method with DiffuScenetang2024diffuscene and InstructScenelin2024instructscene for different zero-shot tasks. Due to space constraints, we omit ATISS paschalidou2021atiss from the first three visualizations.