Table of Contents
Fetching ...

Hier-SLAM: Scaling-up Semantics in SLAM with a Hierarchically Categorical Gaussian Splatting

Boying Li, Zhixi Cai, Yuan-Fang Li, Ian Reid, Hamid Rezatofighi

TL;DR

Hier-SLAM tackles semantic-SLAM scalability by introducing a hierarchical, LLM-assisted semantic encoding within a 3D Gaussian Splatting framework. It jointly optimizes hierarchical semantic embeddings through inter-level and cross-level losses to enable coarse-to-fine semantic understanding while keeping storage and training requirements low. Empirical results on Replica and ScanNet demonstrate improved tracking and mapping performance, competitive semantic rendering, and notable scalability to hundreds of semantic classes, with rendering speeds up to 2,000–3,000 FPS. The approach enables real-time, scalable semantic mapping in complex real-world scenes and is released as open-source.

Abstract

We propose Hier-SLAM, a semantic 3D Gaussian Splatting SLAM method featuring a novel hierarchical categorical representation, which enables accurate global 3D semantic mapping, scaling-up capability, and explicit semantic label prediction in the 3D world. The parameter usage in semantic SLAM systems increases significantly with the growing complexity of the environment, making it particularly challenging and costly for scene understanding. To address this problem, we introduce a novel hierarchical representation that encodes semantic information in a compact form into 3D Gaussian Splatting, leveraging the capabilities of large language models (LLMs). We further introduce a novel semantic loss designed to optimize hierarchical semantic information through both inter-level and cross-level optimization. Furthermore, we enhance the whole SLAM system, resulting in improved tracking and mapping performance. Our \MethodName{} outperforms existing dense SLAM methods in both mapping and tracking accuracy, while achieving a 2x operation speed-up. Additionally, it achieves on-par semantic rendering performance compared to existing methods while significantly reducing storage and training time requirements. Rendering FPS impressively reaches 2,000 with semantic information and 3,000 without it. Most notably, it showcases the capability of handling the complex real-world scene with more than 500 semantic classes, highlighting its valuable scaling-up capability. The open-source code is available at https://github.com/LeeBY68/Hier-SLAM

Hier-SLAM: Scaling-up Semantics in SLAM with a Hierarchically Categorical Gaussian Splatting

TL;DR

Hier-SLAM tackles semantic-SLAM scalability by introducing a hierarchical, LLM-assisted semantic encoding within a 3D Gaussian Splatting framework. It jointly optimizes hierarchical semantic embeddings through inter-level and cross-level losses to enable coarse-to-fine semantic understanding while keeping storage and training requirements low. Empirical results on Replica and ScanNet demonstrate improved tracking and mapping performance, competitive semantic rendering, and notable scalability to hundreds of semantic classes, with rendering speeds up to 2,000–3,000 FPS. The approach enables real-time, scalable semantic mapping in complex real-world scenes and is released as open-source.

Abstract

We propose Hier-SLAM, a semantic 3D Gaussian Splatting SLAM method featuring a novel hierarchical categorical representation, which enables accurate global 3D semantic mapping, scaling-up capability, and explicit semantic label prediction in the 3D world. The parameter usage in semantic SLAM systems increases significantly with the growing complexity of the environment, making it particularly challenging and costly for scene understanding. To address this problem, we introduce a novel hierarchical representation that encodes semantic information in a compact form into 3D Gaussian Splatting, leveraging the capabilities of large language models (LLMs). We further introduce a novel semantic loss designed to optimize hierarchical semantic information through both inter-level and cross-level optimization. Furthermore, we enhance the whole SLAM system, resulting in improved tracking and mapping performance. Our \MethodName{} outperforms existing dense SLAM methods in both mapping and tracking accuracy, while achieving a 2x operation speed-up. Additionally, it achieves on-par semantic rendering performance compared to existing methods while significantly reducing storage and training time requirements. Rendering FPS impressively reaches 2,000 with semantic information and 3,000 without it. Most notably, it showcases the capability of handling the complex real-world scene with more than 500 semantic classes, highlighting its valuable scaling-up capability. The open-source code is available at https://github.com/LeeBY68/Hier-SLAM
Paper Structure (16 sections, 9 equations, 4 figures, 6 tables)

This paper contains 16 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: (a). The global 3D Gaussian map generated by Hier-SLAM with learned semantic labels is shown on the left. The hierarchical structure of the semantic information is organized on the right, considering both semantic and geometric attributes (the second blue box). The proposed hierarchical categorical representation compresses semantic data, reducing both memory usage and training time of the semantic SLAM. (b). The rendered semantic map at different levels shows a coarse-to-fine understanding, beneficial for real-world scenarios with shifting perspectives from distant to close.
  • Figure 2: Left: Overview of the Hier-SLAM pipeline. The global 3D Gaussian map is initialized with the first image. The system then alternates between Tracking and Mapping steps as new frames are processed (see Section III-C). Top Right: Hierarchical representation of semantic information. The Tree Generation process uses a Loop-based critic operation, including a LLM and a Validator, to create a tree coding from leaf-to-root. This tree is used to establish hierarchical coding for each Gaussian primitive (see Section III-A). Additionally, a novel loss combining Inter-level Loss $L_\text{Inter}$ and Cross-level Loss $L_\text{Cross}$ is proposed for hierarchical semantic optimization (see Section III-B). Bottom Right: An example of hierarchical semantic rendering.
  • Figure 3: Visualization of our semantic rendering performance on the Replica straub2019replica dataset. The first four rows demonstrate rendered semantic segmentation in a coarse-to-fine manner. The fifth row exhibits the finest semantic rendering, equivalent to the flat representation with $102$ original semantic classes from the Replica dataset. The last row visualizes the semantic ground truth for comparison.
  • Figure 4: Visualization of the established semantic 3D map across multiple levels, demonstrating a coarse-to-fine semantic understanding of the complex scene. The bottom of the figure displays localization, mapping, and rendering performance, providing a comprehensive overview.