Table of Contents
Fetching ...

OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics

Jisang Yoo, Gyeongjin Kang, Hyun-kyu Ko, Hyeonwoo Yu, Eunbyung Park

TL;DR

OpenMonoGS-SLAM tackles monocular SLAM with open-set semantic understanding by integrating 3D Gaussian Splatting with visual foundation models. It operates without depth input or 3D semantic ground truth and uses a memory-augmented semantic fusion pipeline together with self-supervised losses to learn geometry and open-set semantics from RGB data. The method leverages MASt3R for tracking, SAM for 2D segmentation, and CLIP for language-grounded features, organized via a compact memory bank. Experiments on Replica and TUM-D show competitive mapping quality and superior open-set segmentation, highlighting the potential of VFMs to extend SLAM into open-world perception.

Abstract

Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.

OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics

TL;DR

OpenMonoGS-SLAM tackles monocular SLAM with open-set semantic understanding by integrating 3D Gaussian Splatting with visual foundation models. It operates without depth input or 3D semantic ground truth and uses a memory-augmented semantic fusion pipeline together with self-supervised losses to learn geometry and open-set semantics from RGB data. The method leverages MASt3R for tracking, SAM for 2D segmentation, and CLIP for language-grounded features, organized via a compact memory bank. Experiments on Replica and TUM-D show competitive mapping quality and superior open-set segmentation, highlighting the potential of VFMs to extend SLAM into open-world perception.

Abstract

Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.

Paper Structure

This paper contains 15 sections, 10 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of Our Method.Top: Given the previous keyframe and the current frame, MASt3R estimates a point map. We reconstruct a 3D semantic map by augmenting each point with Gaussian attributes and a semantic feature vector. Rendering the 3D Gaussians yields an RGB color and a semantic feature map, which are supervised by the ground truth RGB image and multi-scale masks generated by SAM, respectively. Bottom: When the current frame is selected as a new keyframe, SAM generates instance masks, and masked CLIP features are extracted by applying the masks to the RGB image. These masked CLIP features are used to update the memory bank online and serve as the supervision target for the language-guided loss. The semantic map is further enhanced by memory attention to obtain a high-dimensional semantic map.
  • Figure 2: Qualitative comparisons of novel view synthesis on the Replica dataset.
  • Figure 3: Qualitative comparisons of open-set segmentation on the Replica dataset. Our method produces cleaner and more complete segmentation masks, particularly for fine-grained structures.
  • Figure 4: Qualitative comparison of ablated components on the Replica dataset. The top row shows the open-set segmentation results, while the bottom row presents the corresponding rendered features (with the ground-truth RGB image in the first column).