Table of Contents
Fetching ...

Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning

Fei Yu, Quan Deng, Shengeng Tang, Yuehua Li, Lechao Cheng

TL;DR

This work tackles open-world 3D scene understanding by marrying open-vocabulary perception with retrieval-augmented reasoning. It introduces a two-part framework: a dynamic Open-World 3D Scene Graph Generator that builds semantic and spatial representations from RGB-D data without fixed labels, and a Retrieval-Augmented Reasoning pipeline that encodes scene graphs into a vector database for grounded, text/image-conditioned queries. By converting graphs into semantically enriched chunks and prompting grounded LLM reasoning, the approach supports four multimodal tasks—text-based QA, text-to-visual grounding, multimodal instance retrieval, and open-scene task planning—within dynamic environments. Empirical results on 3DSSG and Replica demonstrate strong zero-shot generalization, competitive graph generation, and effective scene-driven interaction, highlighting the practical impact for open-world robotics and embodied agents.

Abstract

Understanding 3D scenes in open-world settings poses fundamental challenges for vision and robotics, particularly due to the limitations of closed-vocabulary supervision and static annotations. To address this, we propose a unified framework for Open-World 3D Scene Graph Generation with Retrieval-Augmented Reasoning, which enables generalizable and interactive 3D scene understanding. Our method integrates Vision-Language Models (VLMs) with retrieval-based reasoning to support multimodal exploration and language-guided interaction. The framework comprises two key components: (1) a dynamic scene graph generation module that detects objects and infers semantic relationships without fixed label sets, and (2) a retrieval-augmented reasoning pipeline that encodes scene graphs into a vector database to support text/image-conditioned queries. We evaluate our method on 3DSSG and Replica benchmarks across four tasks-scene question answering, visual grounding, instance retrieval, and task planning-demonstrating robust generalization and superior performance in diverse environments. Our results highlight the effectiveness of combining open-vocabulary perception with retrieval-based reasoning for scalable 3D scene understanding.

Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning

TL;DR

This work tackles open-world 3D scene understanding by marrying open-vocabulary perception with retrieval-augmented reasoning. It introduces a two-part framework: a dynamic Open-World 3D Scene Graph Generator that builds semantic and spatial representations from RGB-D data without fixed labels, and a Retrieval-Augmented Reasoning pipeline that encodes scene graphs into a vector database for grounded, text/image-conditioned queries. By converting graphs into semantically enriched chunks and prompting grounded LLM reasoning, the approach supports four multimodal tasks—text-based QA, text-to-visual grounding, multimodal instance retrieval, and open-scene task planning—within dynamic environments. Empirical results on 3DSSG and Replica demonstrate strong zero-shot generalization, competitive graph generation, and effective scene-driven interaction, highlighting the practical impact for open-world robotics and embodied agents.

Abstract

Understanding 3D scenes in open-world settings poses fundamental challenges for vision and robotics, particularly due to the limitations of closed-vocabulary supervision and static annotations. To address this, we propose a unified framework for Open-World 3D Scene Graph Generation with Retrieval-Augmented Reasoning, which enables generalizable and interactive 3D scene understanding. Our method integrates Vision-Language Models (VLMs) with retrieval-based reasoning to support multimodal exploration and language-guided interaction. The framework comprises two key components: (1) a dynamic scene graph generation module that detects objects and infers semantic relationships without fixed label sets, and (2) a retrieval-augmented reasoning pipeline that encodes scene graphs into a vector database to support text/image-conditioned queries. We evaluate our method on 3DSSG and Replica benchmarks across four tasks-scene question answering, visual grounding, instance retrieval, and task planning-demonstrating robust generalization and superior performance in diverse environments. Our results highlight the effectiveness of combining open-vocabulary perception with retrieval-based reasoning for scalable 3D scene understanding.

Paper Structure

This paper contains 54 sections, 33 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of the proposed framework for Open-World 3D Scene Graph Generation and Retrieval-Augmented Navigation. The framework comprises two key components: (1) a 3D Scene Graph Generator that incrementally builds semantic and spatial representations from RGB-D sequences by detecting objects, estimating poses, selecting optimal viewpoints, and extracting inter-object relations via vision-language reasoning; and (2) a Retrieval-Augmented Reasoning module that transforms the scene graph into a vectorized knowledge base to support three categories of interaction: (i) spatial object queries, (ii) semantic relationship reasoning, and (iii) instance-level retrieval. This integrated design enables grounded, multimodal, and context-aware interaction within dynamic open-world 3D environments.
  • Figure 2: Comparison of Text-Based Scene Question Answering. (Task I).
  • Figure 3: Comparison of Text-to-Visual Grounding with the MLLMs (Task II).
  • Figure 4: Example of Instance-Level Query Answering Based on 3D Scene Graph Generation (Task III).
  • Figure 5: Exampple of Text-based Scene Question Answering (Task I & II)
  • ...and 2 more figures