Table of Contents
Fetching ...

SNI-SLAM: Semantic Neural Implicit SLAM

Siting Zhu, Guangming Wang, Hermann Blum, Jiuming Liu, Liang Song, Marc Pollefeys, Hesheng Wang

TL;DR

This work proposes SNI-SLAM, a semantic SLAM system utilizing neural implicit representation that simultaneously performs accurate semantic mapping, high-quality surface reconstruction, and robust camera tracking, and introduces hierarchical semantic representation to allow multi-level semantic comprehension for top-down structured semantic mapping of the scene.

Abstract

We propose SNI-SLAM, a semantic SLAM system utilizing neural implicit representation, that simultaneously performs accurate semantic mapping, high-quality surface reconstruction, and robust camera tracking. In this system, we introduce hierarchical semantic representation to allow multi-level semantic comprehension for top-down structured semantic mapping of the scene. In addition, to fully utilize the correlation between multiple attributes of the environment, we integrate appearance, geometry and semantic features through cross-attention for feature collaboration. This strategy enables a more multifaceted understanding of the environment, thereby allowing SNI-SLAM to remain robust even when single attribute is defective. Then, we design an internal fusion-based decoder to obtain semantic, RGB, Truncated Signed Distance Field (TSDF) values from multi-level features for accurate decoding. Furthermore, we propose a feature loss to update the scene representation at the feature level. Compared with low-level losses such as RGB loss and depth loss, our feature loss is capable of guiding the network optimization on a higher-level. Our SNI-SLAM method demonstrates superior performance over all recent NeRF-based SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in accurate semantic segmentation and real-time semantic mapping.

SNI-SLAM: Semantic Neural Implicit SLAM

TL;DR

This work proposes SNI-SLAM, a semantic SLAM system utilizing neural implicit representation that simultaneously performs accurate semantic mapping, high-quality surface reconstruction, and robust camera tracking, and introduces hierarchical semantic representation to allow multi-level semantic comprehension for top-down structured semantic mapping of the scene.

Abstract

We propose SNI-SLAM, a semantic SLAM system utilizing neural implicit representation, that simultaneously performs accurate semantic mapping, high-quality surface reconstruction, and robust camera tracking. In this system, we introduce hierarchical semantic representation to allow multi-level semantic comprehension for top-down structured semantic mapping of the scene. In addition, to fully utilize the correlation between multiple attributes of the environment, we integrate appearance, geometry and semantic features through cross-attention for feature collaboration. This strategy enables a more multifaceted understanding of the environment, thereby allowing SNI-SLAM to remain robust even when single attribute is defective. Then, we design an internal fusion-based decoder to obtain semantic, RGB, Truncated Signed Distance Field (TSDF) values from multi-level features for accurate decoding. Furthermore, we propose a feature loss to update the scene representation at the feature level. Compared with low-level losses such as RGB loss and depth loss, our feature loss is capable of guiding the network optimization on a higher-level. Our SNI-SLAM method demonstrates superior performance over all recent NeRF-based SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in accurate semantic segmentation and real-time semantic mapping.
Paper Structure (11 sections, 12 equations, 5 figures, 6 tables)

This paper contains 11 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Our SNI-SLAM leverages the correlation of multi-modal features in the environment to conduct semantic SLAM based on Neural Radiance Fields (NeRF). This modeling strategy achieves not only higher accuracy compared with existing NeRF-based SLAM, but also enables real-time semantic mapping. We propose a feature collaboration method between appearance, geometry, and semantics, which significantly enhances the feature representation capabilities. Fused Appearance (orange box): Shadowing on the chair caused by light is eliminated. Fused Geometry (blue box): The inconsistency of the cabinet bottom edge is improved. Fused Semantic (red box): The distinction between table leg and floor is enhanced.
  • Figure 2: An overview of SNI-SLAM. Our method takes an RGB-D stream as input. RGB images are fed into semantic feature extractor to obtain semantic features. These features are then transformed into appearance features through appearance MLP $H_{\theta}$. Geometry features are derived from ray sampling and then processed through geometry MLP $E_{\theta}$. Subsequently, these three types of features are fused using cross-attention based feature fusion and generate feature map. This feature map, the input RGB-D, and the segmentation results obtained from segmentation network serve as supervision signals. Generated features are obtained by interpolation of scene representation, then these features are utilized for feature loss construction as well as to obtain the generated RGB, depth and semantics through decoding and rendering process. Supervision and generated information are used for loss construction to update scene representation and MLP network. We use hierarchical semantic representation for semantic mapping. For camera tracking, we utilize loss functions to optimize camera pose. We follow eslam for geometry and appearance scene representation.
  • Figure 3: Visualization of coarse-level and fine-level features. Coarse-level feature captures general structure and arrangement of components. Fine-level feature provides more fine-grained details.
  • Figure 4: Qualitative comparison on scene reconstruction of our method and baseline. The ground truth images and details are rendered with ReplicaViewer software straub2019replica. We visualize 3 selected scenes of Replica dataset straub2019replica and details are highlighted with colorful boxes. Our method achieves more accurate detailed geometry and higher completion, especially in places that have limited observations.
  • Figure 5: Ablation study of semantic rendering results and ground truth labels on office0 of Replica straub2019replica. We visualize rendering results in different circumstances: (w/o HSM) without Hierarchical Semantic Mapping; (w/o FL) without Feature Loss; (w/o FF) without Feature Fusion. It can be seen from residuals that the whole SNI-SLAM achieves best semantic accuracy.