Table of Contents
Fetching ...

NEDS-SLAM: A Neural Explicit Dense Semantic SLAM Framework using 3D Gaussian Splatting

Yiming Ji, Yang Liu, Guanghu Xie, Boyu Ma, Zongwu Xie

TL;DR

NEDS-SLAM addresses robust dense semantic SLAM by embedding high-dimensional semantic features into 3D Gaussian representations and using differentiable Gaussian splatting for real-time rendering. It introduces SCFF to fuse semantic and appearance cues with spatial consistency, a lightweight encoder to compress semantic features into Gaussian parameters, and Virtual Camera View Pruning (VCVP) to identify and attenuate outlier Gaussians from novel virtual views. The system demonstrates improved camera tracking accuracy and semantically rich reconstructions on Replica and ScanNet datasets, outperforming several baselines and showing strong ablations for SCFF and VCVP. This work advances practical neural implicit SLAM by balancing expressive semantic embedding, memory efficiency, and real-time performance for dense semantic scene understanding.

Abstract

We propose NEDS-SLAM, a dense semantic SLAM system based on 3D Gaussian representation, that enables robust 3D semantic mapping, accurate camera tracking, and high-quality rendering in real-time. In the system, we propose a Spatially Consistent Feature Fusion model to reduce the effect of erroneous estimates from pre-trained segmentation head on semantic reconstruction, achieving robust 3D semantic Gaussian mapping. Additionally, we employ a lightweight encoder-decoder to compress the high-dimensional semantic features into a compact 3D Gaussian representation, mitigating the burden of excessive memory consumption. Furthermore, we leverage the advantage of 3D Gaussian splatting, which enables efficient and differentiable novel view rendering, and propose a Virtual Camera View Pruning method to eliminate outlier gaussians, thereby effectively enhancing the quality of scene representations. Our NEDS-SLAM method demonstrates competitive performance over existing dense semantic SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in 3D dense semantic mapping.

NEDS-SLAM: A Neural Explicit Dense Semantic SLAM Framework using 3D Gaussian Splatting

TL;DR

NEDS-SLAM addresses robust dense semantic SLAM by embedding high-dimensional semantic features into 3D Gaussian representations and using differentiable Gaussian splatting for real-time rendering. It introduces SCFF to fuse semantic and appearance cues with spatial consistency, a lightweight encoder to compress semantic features into Gaussian parameters, and Virtual Camera View Pruning (VCVP) to identify and attenuate outlier Gaussians from novel virtual views. The system demonstrates improved camera tracking accuracy and semantically rich reconstructions on Replica and ScanNet datasets, outperforming several baselines and showing strong ablations for SCFF and VCVP. This work advances practical neural implicit SLAM by balancing expressive semantic embedding, memory efficiency, and real-time performance for dense semantic scene understanding.

Abstract

We propose NEDS-SLAM, a dense semantic SLAM system based on 3D Gaussian representation, that enables robust 3D semantic mapping, accurate camera tracking, and high-quality rendering in real-time. In the system, we propose a Spatially Consistent Feature Fusion model to reduce the effect of erroneous estimates from pre-trained segmentation head on semantic reconstruction, achieving robust 3D semantic Gaussian mapping. Additionally, we employ a lightweight encoder-decoder to compress the high-dimensional semantic features into a compact 3D Gaussian representation, mitigating the burden of excessive memory consumption. Furthermore, we leverage the advantage of 3D Gaussian splatting, which enables efficient and differentiable novel view rendering, and propose a Virtual Camera View Pruning method to eliminate outlier gaussians, thereby effectively enhancing the quality of scene representations. Our NEDS-SLAM method demonstrates competitive performance over existing dense semantic SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in 3D dense semantic mapping.
Paper Structure (17 sections, 10 equations, 6 figures, 8 tables)

This paper contains 17 sections, 10 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of the proposed NEDS-SLAM. Our method takes an RGB-D stream as input. RGB images are processed by the pretrained semantic feature extractor to get semantic features, while dense appearance features are obtained through the Spatial Feature Extractor model. The semantic and appearance features are fused to generate high-dimensional semantic features that are spatially consistent. These features are then processed by the encoder to generate low-dimensional features and embedded into the GS parameters. By employing Differentiable Rendering, real RGB images, depth images, and semantic masks predicted by a pre-trained segmentation head are utilized for Multi-Channel supervision. This approach enables the joint optimization of GS parameters. In the figure, $M$, $C$, and $D$ represent the semantic segmentation mask, color, and depth information, respectively. NEDS-SLAM achieves high-fidelity map reconstructions while simultaneously accomplishing compact and dense pixel-level semantic reconstruction.
  • Figure 2: The concept of virtual view pruning for identifying outlier gaussians. We analyze only the gaussians visible in the current ground-truth view (points $A$, $B$, $C$ in the figure). Point A is not visible from either of the two virtual views, thus identified as an outlier gaussians, and its opacity is degraded during subsequent optimization. While the figure depicts two virtual views in a planar scenario, our approach creates four virtual cameras by rotating the camera pose from the focal point of each GT view frame along four directions: up, down, left, and right.
  • Figure 3: Rendered virtual camera views on the ScanNet dataset. The middle images provide a zoomed-in illustration of the effectiveness of Virtual Camera Pruning, where 'vcvp' denotes virtual camera view. Eliminating outlier gaussians not only improves rendering quality but also reduces the storage footprint of the map representation.
  • Figure 4: The first row shows the RGB reconstruction results. The second row shows the semantic labels predicted directly on the current frame using M2Fcheng2021mask2former. The third row shows the semantic reconstruction results using the SGS-SLAMli2024sgs method based on SplaTAMkeetha2023splatam. The fourth row shows the reconstruction results of our proposed model.
  • Figure 5: The comparison validated the effectiveness of VCVP method. Showing that NEDS-SLAM achieves better reconstruction results in details compared to Splatam.
  • ...and 1 more figures