Table of Contents
Fetching ...

Intelligent Spatial Perception by Building Hierarchical 3D Scene Graphs for Indoor Scenarios with the Help of LLMs

Yao Cheng, Zhe Han, Fengyang Jiang, Huaizhen Wang, Fengyu Zhou, Qingshan Yin, Lei Wei

TL;DR

The paper tackles the challenge of holistic indoor spatial understanding for intelligent navigation by introducing a hierarchical 3D Scene Graph (3DSG) framework powered by LLMs. It designs a multi-layer pipeline (Layer-1 metric-semantic mesh, Layer-2 object layer, and upper layers such as rooms, floors, and buildings) and uses LVLMs/LLMs to annotate node attributes, including a polling-based room classification to reduce hallucinations. Experimental results on office scenes and the Stanford 3D Scene Graph dataset show improved semantic search and robust room labeling, with room-type accuracy reaching $0.952$ under polling. The work facilitates integrated semantic and geometric reasoning for context-aware navigation and task planning in indoor environments.

Abstract

This paper addresses the high demand in advanced intelligent robot navigation for a more holistic understanding of spatial environments, by introducing a novel system that harnesses the capabilities of Large Language Models (LLMs) to construct hierarchical 3D Scene Graphs (3DSGs) for indoor scenarios. The proposed framework constructs 3DSGs consisting of a fundamental layer with rich metric-semantic information, an object layer featuring precise point-cloud representation of object nodes as well as visual descriptors, and higher layers of room, floor, and building nodes. Thanks to the innovative application of LLMs, not only object nodes but also nodes of higher layers, e.g., room nodes, are annotated in an intelligent and accurate manner. A polling mechanism for room classification using LLMs is proposed to enhance the accuracy and reliability of the room node annotation. Thorough numerical experiments demonstrate the system's ability to integrate semantic descriptions with geometric data, creating an accurate and comprehensive representation of the environment instrumental for context-aware navigation and task planning.

Intelligent Spatial Perception by Building Hierarchical 3D Scene Graphs for Indoor Scenarios with the Help of LLMs

TL;DR

The paper tackles the challenge of holistic indoor spatial understanding for intelligent navigation by introducing a hierarchical 3D Scene Graph (3DSG) framework powered by LLMs. It designs a multi-layer pipeline (Layer-1 metric-semantic mesh, Layer-2 object layer, and upper layers such as rooms, floors, and buildings) and uses LVLMs/LLMs to annotate node attributes, including a polling-based room classification to reduce hallucinations. Experimental results on office scenes and the Stanford 3D Scene Graph dataset show improved semantic search and robust room labeling, with room-type accuracy reaching under polling. The work facilitates integrated semantic and geometric reasoning for context-aware navigation and task planning in indoor environments.

Abstract

This paper addresses the high demand in advanced intelligent robot navigation for a more holistic understanding of spatial environments, by introducing a novel system that harnesses the capabilities of Large Language Models (LLMs) to construct hierarchical 3D Scene Graphs (3DSGs) for indoor scenarios. The proposed framework constructs 3DSGs consisting of a fundamental layer with rich metric-semantic information, an object layer featuring precise point-cloud representation of object nodes as well as visual descriptors, and higher layers of room, floor, and building nodes. Thanks to the innovative application of LLMs, not only object nodes but also nodes of higher layers, e.g., room nodes, are annotated in an intelligent and accurate manner. A polling mechanism for room classification using LLMs is proposed to enhance the accuracy and reliability of the room node annotation. Thorough numerical experiments demonstrate the system's ability to integrate semantic descriptions with geometric data, creating an accurate and comprehensive representation of the environment instrumental for context-aware navigation and task planning.

Paper Structure

This paper contains 13 sections, 6 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Illustration of the proposed hierarchical 3DSG
  • Figure 2: Examples and comparison of point-cloud and mesh representations of object nodes
  • Figure 3: 3DSG of an office scenario constructed in real-time with a wheeled robot
  • Figure 4: A scene in the Stanford 3D Scene Graph dataset Armeni20193DSG where the point cloud representation, semantic mesh, and object information of various room segments are presented
  • Figure 5: Results of semantic search in the case of different task queries when using 3DSG with and without detailed object node descriptions, respectively
  • ...and 4 more figures