Table of Contents
Fetching ...

IRIS-SLAM: Unified Geo-Instance Representations for Robust Semantic Localization and Mapping

Tingyang Xiao, Liu Liu, Wei Feng, Zhengyu Zou, Xiaolin Zhou, Wei Sui, Hao Li, Dingwen Zhang, Zhizhong Su

Abstract

Geometry foundation models have significantly advanced dense geometric SLAM, yet existing systems often lack deep semantic understanding and robust loop closure capabilities. Meanwhile, contemporary semantic mapping approaches are frequently hindered by decoupled architectures and fragile data association. We propose IRIS-SLAM, a novel RGB semantic SLAM system that leverages unified geometric-instance representations derived from an instance-extended foundation model. By extending a geometry foundation model to concurrently predict dense geometry and cross-view consistent instance embeddings, we enable a semantic-synergized association mechanism and instance-guided loop closure detection. Our approach effectively utilizes viewpoint-agnostic semantic anchors to bridge the gap between geometric reconstruction and open-vocabulary mapping. Experimental results demonstrate that IRIS-SLAM significantly outperforms state-of-the-art methods, particularly in map consistency and wide-baseline loop closure reliability.

IRIS-SLAM: Unified Geo-Instance Representations for Robust Semantic Localization and Mapping

Abstract

Geometry foundation models have significantly advanced dense geometric SLAM, yet existing systems often lack deep semantic understanding and robust loop closure capabilities. Meanwhile, contemporary semantic mapping approaches are frequently hindered by decoupled architectures and fragile data association. We propose IRIS-SLAM, a novel RGB semantic SLAM system that leverages unified geometric-instance representations derived from an instance-extended foundation model. By extending a geometry foundation model to concurrently predict dense geometry and cross-view consistent instance embeddings, we enable a semantic-synergized association mechanism and instance-guided loop closure detection. Our approach effectively utilizes viewpoint-agnostic semantic anchors to bridge the gap between geometric reconstruction and open-vocabulary mapping. Experimental results demonstrate that IRIS-SLAM significantly outperforms state-of-the-art methods, particularly in map consistency and wide-baseline loop closure reliability.
Paper Structure (22 sections, 11 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 22 sections, 11 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: Pipeline of IRIS-SLAM. The system uses a Unified Geo-Instance Front-end to jointly infer high-fidelity depth and cross-view consistent instance embeddings from monocular RGB streams (Left). These outputs are synergized to construct globally consistent dense semantic maps (Right). The viewpoint-agnostic instance embeddings act as stable semantic anchors, enabling robust loop closure detection under extreme wide-baselines and challenging perspective changes (Bottom).
  • Figure 2: System Architecture of IRIS-SLAM. Our framework establishes a tight coupling between dense geometric reconstruction and instance-level understanding by extending feed-forward 3D foundation models into a shared, multi-view coherent latent space. The pipeline realizes our core contributions through three integrated modules: (1) Unified Geo-Instance Front-end Model: generating consistent geometric-semantic primitives from monocular streams; (2) Instance-Grounded Association: an instance-grounded association mechanism where instance embeddings actively drive data association; and (3) Instance-Guided Loop Closure Back-end: a robust loop closure module leveraging viewpoint-agnostic semantic instance anchors to maintain global map coherence even under extreme pose disparities.
  • Figure 3: Comparative loop closure performance across different overlap thresholds $\tau$. IRIS-SLAM outperforms baselines with significant margins, especially in wide-baseline scenarios ($\tau=0.1$) where traditional descriptors like NetVLAD and ORB_BoW collapse. Our method maintains a resilient F1-score$\uparrow$ and high Recall@1$\uparrow$ by leveraging instance and structural consistency.
  • Figure 4: A loop pair with a $67.4^{\circ}$ viewpoint shift. While ours correctly identifies the loop, the best-performing baselines, NetVLAD and SALAD, yield scores only $0.4596$ and $0.2567$.
  • Figure 5: Qualitative comparison of reconstructed global point clouds on the TUM RGB-D fr1/room sequence: DepthAnything3 (DA3) Stream with Salad loop closure vs. our proposed IRIS-SLAM with instance-guided loop closure.
  • ...and 6 more figures