OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics

Jisang Yoo; Gyeongjin Kang; Hyun-kyu Ko; Hyeonwoo Yu; Eunbyung Park

OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics

Jisang Yoo, Gyeongjin Kang, Hyun-kyu Ko, Hyeonwoo Yu, Eunbyung Park

TL;DR

OpenMonoGS-SLAM tackles monocular SLAM with open-set semantic understanding by integrating 3D Gaussian Splatting with visual foundation models. It operates without depth input or 3D semantic ground truth and uses a memory-augmented semantic fusion pipeline together with self-supervised losses to learn geometry and open-set semantics from RGB data. The method leverages MASt3R for tracking, SAM for 2D segmentation, and CLIP for language-grounded features, organized via a compact memory bank. Experiments on Replica and TUM-D show competitive mapping quality and superior open-set segmentation, highlighting the potential of VFMs to extend SLAM into open-world perception.

Abstract

Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.

OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics

TL;DR

Abstract

OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)