Table of Contents
Fetching ...

KeySG: Hierarchical Keyframe-Based 3D Scene Graphs

Abdelrhman Werby, Dennis Rotondi, Fabio Scaparro, Kai O. Arras

TL;DR

The paper tackles the rigidity and scalability limitations of traditional 3D scene graphs in robotic reasoning by proposing KeySG, a hierarchical, keyframe-based 3D scene graph augmented with multi-modal context. KeySG uses adaptive keyframe sampling to efficiently capture geometry and semantics, VLM-driven open-vocabulary segmentation to create object/functional-element segments, and a hierarchical retrieval-augmented generation (RAG) pipeline to provide task-relevant context to LLM planners. The approach introduces five levels of abstraction (buildings, floors, rooms, objects, functional elements) and combines scene summaries across levels to enable efficient, grounded querying without enumerating explicit inter-object edges. Experimental results across open-vocabulary 3D segmentation, functional-element segmentation, and 3D grounding demonstrate strong performance gains and improved scalability relative to prior 3DSG methods, highlighting the potential of persistent, general-purpose world models for robotics.

Abstract

In recent years, 3D scene graphs have emerged as a powerful world representation, offering both geometric accuracy and semantic richness. Combining 3D scene graphs with large language models enables robots to reason, plan, and navigate in complex human-centered environments. However, current approaches for constructing 3D scene graphs are semantically limited to a predefined set of relationships, and their serialization in large environments can easily exceed an LLM's context window. We introduce KeySG, a framework that represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements, where nodes are augmented with multi-modal information extracted from keyframes selected to optimize geometric and visual coverage. The keyframes allow us to efficiently leverage VLM to extract scene information, alleviating the need to explicitly model relationship edges between objects, enabling more general, task-agnostic reasoning and planning. Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs by utilizing a hierarchical retrieval-augmented generation (RAG) pipeline to extract relevant context from the graph. Evaluated across four distinct benchmarks -- including 3D object segmentation and complex query retrieval -- KeySG outperforms prior approaches on most metrics, demonstrating its superior semantic richness and efficiency.

KeySG: Hierarchical Keyframe-Based 3D Scene Graphs

TL;DR

The paper tackles the rigidity and scalability limitations of traditional 3D scene graphs in robotic reasoning by proposing KeySG, a hierarchical, keyframe-based 3D scene graph augmented with multi-modal context. KeySG uses adaptive keyframe sampling to efficiently capture geometry and semantics, VLM-driven open-vocabulary segmentation to create object/functional-element segments, and a hierarchical retrieval-augmented generation (RAG) pipeline to provide task-relevant context to LLM planners. The approach introduces five levels of abstraction (buildings, floors, rooms, objects, functional elements) and combines scene summaries across levels to enable efficient, grounded querying without enumerating explicit inter-object edges. Experimental results across open-vocabulary 3D segmentation, functional-element segmentation, and 3D grounding demonstrate strong performance gains and improved scalability relative to prior 3DSG methods, highlighting the potential of persistent, general-purpose world models for robotics.

Abstract

In recent years, 3D scene graphs have emerged as a powerful world representation, offering both geometric accuracy and semantic richness. Combining 3D scene graphs with large language models enables robots to reason, plan, and navigate in complex human-centered environments. However, current approaches for constructing 3D scene graphs are semantically limited to a predefined set of relationships, and their serialization in large environments can easily exceed an LLM's context window. We introduce KeySG, a framework that represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements, where nodes are augmented with multi-modal information extracted from keyframes selected to optimize geometric and visual coverage. The keyframes allow us to efficiently leverage VLM to extract scene information, alleviating the need to explicitly model relationship edges between objects, enabling more general, task-agnostic reasoning and planning. Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs by utilizing a hierarchical retrieval-augmented generation (RAG) pipeline to extract relevant context from the graph. Evaluated across four distinct benchmarks -- including 3D object segmentation and complex query retrieval -- KeySG outperforms prior approaches on most metrics, demonstrating its superior semantic richness and efficiency.

Paper Structure

This paper contains 17 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: As illustrated (top), KeySG is a hierarchical, keyframe-based 3D scene graph comprising floors, rooms, objects, and functional elements (bottom right). Each node is augmented with contextual information efficiently extracted from scene keyframes via adaptive keyframe sampling (bottom left). Leveraging a multimodal RAG pipeline, KeySG enables users to ask complex natural language queries and receive answers grounded in the 3D scene (bottom middle).
  • Figure 2: Overview of KeySG: (A) we first reconstruct the full point cloud of the 3D scene and segment it into floors and rooms; (B) for each room, we select keyframes that provide geometric coverage of the entire space while maximizing visual information; (C) we leverage VLMs to extract descriptions, object tags, and functional element tags from the selected keyframes (D) we combine these tags with an open-vocabulary segmentation pipeline to obtain 3D segments of objects and their associated functional elements (E) we employ LLMs to summarize the extracted keyframe descriptions into a dense, informative room summary, and subsequently aggregate room summaries into a floor-level summary, thereby generating contextual information at increasing levels of abstraction within the 3DSG. To enable efficient querying of the 3DSG, we introduce a hierarchical retrieval mechanism grounded in RAG. This mechanism exploits the graph's structure to perform a top-down search: starting from global, high-level concepts and progressively narrowing to local object nodes, ensuring that LLMs receive rich task-relevant content without exceeding their context window.