Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning
Fei Yu, Quan Deng, Shengeng Tang, Yuehua Li, Lechao Cheng
TL;DR
This work tackles open-world 3D scene understanding by marrying open-vocabulary perception with retrieval-augmented reasoning. It introduces a two-part framework: a dynamic Open-World 3D Scene Graph Generator that builds semantic and spatial representations from RGB-D data without fixed labels, and a Retrieval-Augmented Reasoning pipeline that encodes scene graphs into a vector database for grounded, text/image-conditioned queries. By converting graphs into semantically enriched chunks and prompting grounded LLM reasoning, the approach supports four multimodal tasks—text-based QA, text-to-visual grounding, multimodal instance retrieval, and open-scene task planning—within dynamic environments. Empirical results on 3DSSG and Replica demonstrate strong zero-shot generalization, competitive graph generation, and effective scene-driven interaction, highlighting the practical impact for open-world robotics and embodied agents.
Abstract
Understanding 3D scenes in open-world settings poses fundamental challenges for vision and robotics, particularly due to the limitations of closed-vocabulary supervision and static annotations. To address this, we propose a unified framework for Open-World 3D Scene Graph Generation with Retrieval-Augmented Reasoning, which enables generalizable and interactive 3D scene understanding. Our method integrates Vision-Language Models (VLMs) with retrieval-based reasoning to support multimodal exploration and language-guided interaction. The framework comprises two key components: (1) a dynamic scene graph generation module that detects objects and infers semantic relationships without fixed label sets, and (2) a retrieval-augmented reasoning pipeline that encodes scene graphs into a vector database to support text/image-conditioned queries. We evaluate our method on 3DSSG and Replica benchmarks across four tasks-scene question answering, visual grounding, instance retrieval, and task planning-demonstrating robust generalization and superior performance in diverse environments. Our results highlight the effectiveness of combining open-vocabulary perception with retrieval-based reasoning for scalable 3D scene understanding.
