Table of Contents
Fetching ...

INHerit-SG: Incremental Hierarchical Semantic Scene Graphs with RAG-Style Retrieval

YukTungSamuel Fang, Zhikang Shi, Jiabin Qiu, Zixuan Chen, Jieqi Shi, Hao Xu, Jing Huo, Yang Gao

TL;DR

This work redefines the map as a structured, RAG-ready knowledge base where natural-language descriptions are introduced as explicit semantic anchors to better align with human intent and improves the success rate and reliability of complex retrievals, enabling the system to adapt to a broader spectrum of human interaction tasks.

Abstract

Driven by advancements in foundation models, semantic scene graphs have emerged as a prominent paradigm for high-level 3D environmental abstraction in robot navigation. However, existing approaches are fundamentally misaligned with the needs of embodied tasks. As they rely on either offline batch processing or implicit feature embeddings, the maps can hardly support interpretable human-intent reasoning in complex environments. To address these limitations, we present INHerit-SG. We redefine the map as a structured, RAG-ready knowledge base where natural-language descriptions are introduced as explicit semantic anchors to better align with human intent. An asynchronous dual-process architecture, together with a Floor-Room-Area-Object hierarchy, decouples geometric segmentation from time-consuming semantic reasoning. An event-triggered map update mechanism reorganizes the graph only when meaningful semantic events occur. This strategy enables our graph to maintain long-term consistency with relatively low computational overhead. For retrieval, we deploy multi-role Large Language Models (LLMs) to decompose queries into atomic constraints and handle logical negations, and employ a hard-to-soft filtering strategy to ensure robust reasoning. This explicit interpretability improves the success rate and reliability of complex retrievals, enabling the system to adapt to a broader spectrum of human interaction tasks. We evaluate INHerit-SG on a newly constructed dataset, HM3DSem-SQR, and in real-world environments. Experiments demonstrate that our system achieves state-of-the-art performance on complex queries, and reveal its scalability for downstream navigation tasks. Project Page: https://fangyuktung.github.io/INHeritSG.github.io/

INHerit-SG: Incremental Hierarchical Semantic Scene Graphs with RAG-Style Retrieval

TL;DR

This work redefines the map as a structured, RAG-ready knowledge base where natural-language descriptions are introduced as explicit semantic anchors to better align with human intent and improves the success rate and reliability of complex retrievals, enabling the system to adapt to a broader spectrum of human interaction tasks.

Abstract

Driven by advancements in foundation models, semantic scene graphs have emerged as a prominent paradigm for high-level 3D environmental abstraction in robot navigation. However, existing approaches are fundamentally misaligned with the needs of embodied tasks. As they rely on either offline batch processing or implicit feature embeddings, the maps can hardly support interpretable human-intent reasoning in complex environments. To address these limitations, we present INHerit-SG. We redefine the map as a structured, RAG-ready knowledge base where natural-language descriptions are introduced as explicit semantic anchors to better align with human intent. An asynchronous dual-process architecture, together with a Floor-Room-Area-Object hierarchy, decouples geometric segmentation from time-consuming semantic reasoning. An event-triggered map update mechanism reorganizes the graph only when meaningful semantic events occur. This strategy enables our graph to maintain long-term consistency with relatively low computational overhead. For retrieval, we deploy multi-role Large Language Models (LLMs) to decompose queries into atomic constraints and handle logical negations, and employ a hard-to-soft filtering strategy to ensure robust reasoning. This explicit interpretability improves the success rate and reliability of complex retrievals, enabling the system to adapt to a broader spectrum of human interaction tasks. We evaluate INHerit-SG on a newly constructed dataset, HM3DSem-SQR, and in real-world environments. Experiments demonstrate that our system achieves state-of-the-art performance on complex queries, and reveal its scalability for downstream navigation tasks. Project Page: https://fangyuktung.github.io/INHeritSG.github.io/
Paper Structure (44 sections, 1 equation, 18 figures, 7 tables)

This paper contains 44 sections, 1 equation, 18 figures, 7 tables.

Figures (18)

  • Figure 1: INHerit-SG Overview. Our system build a hierarchical semantic memory during online exploration and operate closed-loop retrieval. (Left) The hierarchical scene graph of a real-world office building built through incremental mapping. (Right) The robot parses a complex query into structural constraints and follows the retrieval pipeline to complete the task sequentially.
  • Figure 2: The INHerit-SG Framework. The system bridges real-time mapping with logic-aware retrieval. (Left) The pipeline employs a dual-stream architecture to balance tracking and reasoning. A Event-Triggered Map module (top-left) optimizes topological updates based on VLM decisions, while the Incremental Association block (bottom-left) fuses SAM3/DINOv3 features to instantiate nodes. (Center) The resulting data structure is a multi-level scene graph that explicitly models topological relationships. (Right) Complex queries are decomposed by Multi-role LLMs into specific constraints, including negation and weights. The system ranks candidates using a scoring function and executes a final VLM Verification step to ensure precise intent grounding.
  • Figure 3: Dual-Stream Construction Pipeline. We decouple mapping into a Geometric Stream (top) for online room segmentation and an asynchronous Semantic Stream (bottom) for fine-grained object reasoning. These threads converge via an Event-Trigger mechanism, which incrementally construct the hierarchical scene graph from the bottom up.
  • Figure 4: Incremental Node Association Logic. The association process follows a two-stage cascade. Stage 1 filters high-confidence matches using strict geometric and visual thresholds. Stage 2 resolves ambiguities based on semantic specificity, enforcing label consistency for known categories while relying on high visual similarity for generic, open-vocabulary objects.
  • Figure 5: Event-Triggered Update. Instead of fixed-frequency updates, our system monitors topological events. (Left) A BEV map tracks historical update points (Blue Wedges) and room transitions. (Right) When an update is triggered, the system selects representative observations to summarize the room's semantics and re-assigns objects to correct early segmentation errors.
  • ...and 13 more figures