Table of Contents
Fetching ...

STAMICS: Splat, Track And Map with Integrated Consistency and Semantics for Dense RGB-D SLAM

Yongxu Wang, Xu Cao, Weiyun Yi, Zhaoxin Fan

TL;DR

STAMICS tackles semantic drift in dense RGB-D SLAM by fusing semantic information with 3D Gaussian splatting. It introduces semantic-enhanced Gaussian representations, a temporal semantic consistency pipeline, and open-vocabulary expansion to label unseen objects, all optimized via differentiable rendering. The framework yields improved pose accuracy and map fidelity across multiple benchmarks, outperforming or matching state-of-the-art methods while handling dynamic and diverse environments. The approach advances dense SLAM by providing coherent semantics over time and flexible vocabulary, with practical implications for robust autonomous perception.

Abstract

Simultaneous Localization and Mapping (SLAM) is a critical task in robotics, enabling systems to autonomously navigate and understand complex environments. Current SLAM approaches predominantly rely on geometric cues for mapping and localization, but they often fail to ensure semantic consistency, particularly in dynamic or densely populated scenes. To address this limitation, we introduce STAMICS, a novel method that integrates semantic information with 3D Gaussian representations to enhance both localization and mapping accuracy. STAMICS consists of three key components: a 3D Gaussian-based scene representation for high-fidelity reconstruction, a graph-based clustering technique that enforces temporal semantic consistency, and an open-vocabulary system that allows for the classification of unseen objects. Extensive experiments show that STAMICS significantly improves camera pose estimation and map quality, outperforming state-of-the-art methods while reducing reconstruction errors. Code will be public available.

STAMICS: Splat, Track And Map with Integrated Consistency and Semantics for Dense RGB-D SLAM

TL;DR

STAMICS tackles semantic drift in dense RGB-D SLAM by fusing semantic information with 3D Gaussian splatting. It introduces semantic-enhanced Gaussian representations, a temporal semantic consistency pipeline, and open-vocabulary expansion to label unseen objects, all optimized via differentiable rendering. The framework yields improved pose accuracy and map fidelity across multiple benchmarks, outperforming or matching state-of-the-art methods while handling dynamic and diverse environments. The approach advances dense SLAM by providing coherent semantics over time and flexible vocabulary, with practical implications for robust autonomous perception.

Abstract

Simultaneous Localization and Mapping (SLAM) is a critical task in robotics, enabling systems to autonomously navigate and understand complex environments. Current SLAM approaches predominantly rely on geometric cues for mapping and localization, but they often fail to ensure semantic consistency, particularly in dynamic or densely populated scenes. To address this limitation, we introduce STAMICS, a novel method that integrates semantic information with 3D Gaussian representations to enhance both localization and mapping accuracy. STAMICS consists of three key components: a 3D Gaussian-based scene representation for high-fidelity reconstruction, a graph-based clustering technique that enforces temporal semantic consistency, and an open-vocabulary system that allows for the classification of unseen objects. Extensive experiments show that STAMICS significantly improves camera pose estimation and map quality, outperforming state-of-the-art methods while reducing reconstruction errors. Code will be public available.

Paper Structure

This paper contains 16 sections, 13 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Illustration of our motivation: From left to right, they are the 3D reconstruction map, the semantic reconstruction map, and the depth reconstruction map.
  • Figure 2: Overview: The RGB data is processed by the SAM to extract semantic information, which is then fed into the tracking module to localize the camera. Semantic-Enhanced Gaussian Splatting integrates the semantic data into the geometric reconstruction process, ensuring consistency between semantics and geometry. The process is governed by four types of losses, among which the semantic consistency loss originates from the semantic consistency module. The final output features open-vocabulary characteristics, with open vocabulary expansion enabling dynamic learning of new objects and achieving superior reconstruction results.
  • Figure 3: Illustration of graph clustering. Nodes with high semantic consistency scores are grouped into the same category in the graph $G(k)$. Edges between inconsistent nodes are removed, resulting in a new graph $G'(k)$.
  • Figure 4: Illustration of the consistency score. The semantic label for the cabinet in the second frame is inconsistent. For the cabinet node in the first frame, the consistency score is $3/4$.
  • Figure 5: Comparison of reconstruction results with existing methods