Table of Contents
Fetching ...

GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding

Xihan Wang, Dianyi Yang, Yu Gao, Yufeng Yue, Yi Yang, Mengyin Fu

TL;DR

This work proposes GaussianGraph, a novel framework that enhances 3DGS-based scene understanding by integrating adaptive semantic clustering and scene graph generation, and introduces a ‘Control-Follow’ clustering strategy, which dynamically adapts to scene scale and feature distribution, avoiding feature compression and significantly improving segmentation accuracy.

Abstract

Recent advancements in 3D Gaussian Splatting(3DGS) have significantly improved semantic scene understanding, enabling natural language queries to localize objects within a scene. However, existing methods primarily focus on embedding compressed CLIP features to 3D Gaussians, suffering from low object segmentation accuracy and lack spatial reasoning capabilities. To address these limitations, we propose GaussianGraph, a novel framework that enhances 3DGS-based scene understanding by integrating adaptive semantic clustering and scene graph generation. We introduce a "Control-Follow" clustering strategy, which dynamically adapts to scene scale and feature distribution, avoiding feature compression and significantly improving segmentation accuracy. Additionally, we enrich scene representation by integrating object attributes and spatial relations extracted from 2D foundation models. To address inaccuracies in spatial relationships, we propose 3D correction modules that filter implausible relations through spatial consistency verification, ensuring reliable scene graph construction. Extensive experiments on three datasets demonstrate that GaussianGraph outperforms state-of-the-art methods in both semantic segmentation and object grounding tasks, providing a robust solution for complex scene understanding and interaction.

GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding

TL;DR

This work proposes GaussianGraph, a novel framework that enhances 3DGS-based scene understanding by integrating adaptive semantic clustering and scene graph generation, and introduces a ‘Control-Follow’ clustering strategy, which dynamically adapts to scene scale and feature distribution, avoiding feature compression and significantly improving segmentation accuracy.

Abstract

Recent advancements in 3D Gaussian Splatting(3DGS) have significantly improved semantic scene understanding, enabling natural language queries to localize objects within a scene. However, existing methods primarily focus on embedding compressed CLIP features to 3D Gaussians, suffering from low object segmentation accuracy and lack spatial reasoning capabilities. To address these limitations, we propose GaussianGraph, a novel framework that enhances 3DGS-based scene understanding by integrating adaptive semantic clustering and scene graph generation. We introduce a "Control-Follow" clustering strategy, which dynamically adapts to scene scale and feature distribution, avoiding feature compression and significantly improving segmentation accuracy. Additionally, we enrich scene representation by integrating object attributes and spatial relations extracted from 2D foundation models. To address inaccuracies in spatial relationships, we propose 3D correction modules that filter implausible relations through spatial consistency verification, ensuring reliable scene graph construction. Extensive experiments on three datasets demonstrate that GaussianGraph outperforms state-of-the-art methods in both semantic segmentation and object grounding tasks, providing a robust solution for complex scene understanding and interaction.

Paper Structure

This paper contains 15 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison with other CLIP-based approaches. Confronted with textual queries involving spatial relationships, CLIP features cannot accurately identify objects solely based on similarity computation. Our method associates Gaussian clusters with descriptions and relations, enabling large language models(LLMs) to reason about the target object.
  • Figure 2: Method overview. The goal of GaussianGraph is constructing 3D scene graph in open-world scenes for downstream tasks. First, We extract 2D features including CLIP, segmentation, captions and relations. Foreground objects and object-pairs are input to LLaVA with prompts to generate captions and relations, which are combined with CLIP features and segmentation by mask index. Second, with posed multi-view images, we utilize 3DGS to reconstruct the scene and perform "Control-Follow" clustering strategy to generate Gaussian clusters. Third, after 3D Gaussian clustering, we build 3D scene graph through rendering each cluster to multi-view images and match them with CLIP features, captions and relations. Finally, 3D correction modules are used to refine the scene graph with four sub-modules.
  • Figure 3: Visualization of GT segmentation and our instance feature map. It illustrates that the instance feature can effectively distinguish objects.
  • Figure 4: Process of LLM-guided object grounding. The 3D scene graph includes the information of Gaussian_id, attributes and relations. With queries input to the model, we use LLM to infer the target Gaussian cluster id through prompts 1 and prompts 2.
  • Figure 5: Qualitative results of our GaussianGraph and other 3DGS-based approaches in object grounding. Our GaussianGraph can reason the accurate object category with less artifacts and noise.
  • ...and 1 more figures