Table of Contents
Fetching ...

OpenGS-SLAM: Open-Set Dense Semantic SLAM with 3D Gaussian Splatting for Object-Level Scene Understanding

Dianyi Yang, Yu Gao, Xihan Wang, Yufeng Yue, Yi Yang, Mengyin Fu

TL;DR

OpenGS-SLAM tackles open-set dense semantic SLAM by attaching explicit semantic labels to each Gaussian in a 3D Gaussian Splatting representation, enabling online 3D object-level scene understanding. It introduces Gaussian Voting Splatting for fast 2D label rendering, Confidence-based 2D Label Consensus to stabilize cross-view labeling, and Segmentation Counter Pruning to refine segmentation, all powered by an Ensemble Semantic Information Generator that leverages 2D foundation vision systems without extra training. The combination yields notable gains in semantic rendering speed and storage efficiency, while achieving strong tracking and reconstruction performance on Replica and TUM datasets, demonstrating practical impact for open-world robotic mapping and interaction. However, the approach focuses on static scenes; extending to dynamic environments and expanding open-set data remains future work.

Abstract

Recent advancements in 3D Gaussian Splatting have significantly improved the efficiency and quality of dense semantic SLAM. However, previous methods are generally constrained by limited-category pre-trained classifiers and implicit semantic representation, which hinder their performance in open-set scenarios and restrict 3D object-level scene understanding. To address these issues, we propose OpenGS-SLAM, an innovative framework that utilizes 3D Gaussian representation to perform dense semantic SLAM in open-set environments. Our system integrates explicit semantic labels derived from 2D foundational models into the 3D Gaussian framework, facilitating robust 3D object-level scene understanding. We introduce Gaussian Voting Splatting to enable fast 2D label map rendering and scene updating. Additionally, we propose a Confidence-based 2D Label Consensus method to ensure consistent labeling across multiple views. Furthermore, we employ a Segmentation Counter Pruning strategy to improve the accuracy of semantic scene representation. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our method in scene understanding, tracking, and mapping, achieving 10 times faster semantic rendering and 2 times lower storage costs compared to existing methods. Project page: https://young-bit.github.io/opengs-github.github.io/.

OpenGS-SLAM: Open-Set Dense Semantic SLAM with 3D Gaussian Splatting for Object-Level Scene Understanding

TL;DR

OpenGS-SLAM tackles open-set dense semantic SLAM by attaching explicit semantic labels to each Gaussian in a 3D Gaussian Splatting representation, enabling online 3D object-level scene understanding. It introduces Gaussian Voting Splatting for fast 2D label rendering, Confidence-based 2D Label Consensus to stabilize cross-view labeling, and Segmentation Counter Pruning to refine segmentation, all powered by an Ensemble Semantic Information Generator that leverages 2D foundation vision systems without extra training. The combination yields notable gains in semantic rendering speed and storage efficiency, while achieving strong tracking and reconstruction performance on Replica and TUM datasets, demonstrating practical impact for open-world robotic mapping and interaction. However, the approach focuses on static scenes; extending to dynamic environments and expanding open-set data remains future work.

Abstract

Recent advancements in 3D Gaussian Splatting have significantly improved the efficiency and quality of dense semantic SLAM. However, previous methods are generally constrained by limited-category pre-trained classifiers and implicit semantic representation, which hinder their performance in open-set scenarios and restrict 3D object-level scene understanding. To address these issues, we propose OpenGS-SLAM, an innovative framework that utilizes 3D Gaussian representation to perform dense semantic SLAM in open-set environments. Our system integrates explicit semantic labels derived from 2D foundational models into the 3D Gaussian framework, facilitating robust 3D object-level scene understanding. We introduce Gaussian Voting Splatting to enable fast 2D label map rendering and scene updating. Additionally, we propose a Confidence-based 2D Label Consensus method to ensure consistent labeling across multiple views. Furthermore, we employ a Segmentation Counter Pruning strategy to improve the accuracy of semantic scene representation. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our method in scene understanding, tracking, and mapping, achieving 10 times faster semantic rendering and 2 times lower storage costs compared to existing methods. Project page: https://young-bit.github.io/opengs-github.github.io/.

Paper Structure

This paper contains 15 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Compared to the feature-embedded methodsSemGaussSLAMNEDS, our approach integrates semantic labels into the 3D Gaussian scene representation, ensuring that Gaussians belonging to the same object are consistently labeled. This enables more effective 3D object-level scene understanding and interaction. By leveraging 2D foundational vision models, our approach facilitates open-set dense semantic SLAM. The images on the left are from SemGaussSLAM.
  • Figure 2: An overview of OpenGS-SLAM. Our method takes an RGB-D stream as input. RGB images are first processed by the Semantic Information Generator and G-ICP to extract semantic information and estimate the current pose. Using this pose, we perform precise and efficient semantic rendering via Gaussian Voting Splatting. We then unify the input label map with the current map through Confidence-based 2D Label Consensus, ensuring semantic consistency. During this process, partial Gaussian data is updated, and counter Gaussians are pruned.
  • Figure 3: Effect of Segmentation Counter Pruning. Left: We select a region with less view constraint throughout the SLAM process. Right: Rendered results from new viewpoints with and without the Segmentation Counter Pruning in this region.
  • Figure 4: Qualitative comparison of novel-view open-set semantic segmentation. For TUM, novel views refer to viewpoints that are not included in the training data, and the ground truth is obtained from manual annotations.
  • Figure 5: Scene manipulation process and results. Left: We select an object from the scene for removal. Middle: Compared to GS-Grouping, our model demonstrates more efficient interactive processing. Right: Our method achieves superior manipulation results.
  • ...and 1 more figures