Table of Contents
Fetching ...

SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors

Kunyi Li, Michael Niemeyer, Sen Wang, Stefano Gasperini, Nassir Navab, Federico Tombari

TL;DR

SING3R-SLAM tackles drift and inefficiency in dense monocular SLAM by coupling locally accurate submap reconstructions with a globally consistent Gaussian map. The system uses Sub-Track3R to build and align submaps, and a Gaussian Mapper to jointly refine poses and geometry through multi-view rendering and optimization, with a bidirectional loop-closure mechanism feeding back into tracking. Key contributions include inter- and intra-submap registration, a differentiable global Gaussian map, and a robust backend that enforces global consistency while remaining memory-efficient. Experiments on 7-scenes and ScanNet-v2 show state-of-the-art tracking, high-fidelity geometry, and superior novel-view rendering, with a compact map size around 7 MB, highlighting practical impact for long indoor sequences and downstream tasks like NVS.

Abstract

Recent advances in dense 3D reconstruction enable the accurate capture of local geometry; however, integrating them into SLAM is challenging due to drift and redundant point maps, which limit efficiency and downstream tasks, such as novel view synthesis. To address these issues, we propose SING3R-SLAM, a globally consistent and compact Gaussian-based dense RGB SLAM framework. The key idea is to combine locally consistent 3D reconstructions with a unified global Gaussian representation that jointly refines scene geometry and camera poses, enabling efficient and versatile 3D mapping for multiple downstream applications. SING3R-SLAM first builds locally consistent submaps through our lightweight tracking and reconstruction module, and then progressively aligns and fuses them into a global Gaussian map that enforces cross-view geometric consistency. This global map, in turn, provides feedback to correct local drift and enhance the robustness of tracking. Extensive experiments demonstrate that SING3R-SLAM achieves state-of-the-art tracking, 3D reconstruction, and novel view rendering, resulting in over 12% improvement in tracking and producing finer, more detailed geometry, all while maintaining a compact and memory-efficient global representation on real-world datasets.

SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors

TL;DR

SING3R-SLAM tackles drift and inefficiency in dense monocular SLAM by coupling locally accurate submap reconstructions with a globally consistent Gaussian map. The system uses Sub-Track3R to build and align submaps, and a Gaussian Mapper to jointly refine poses and geometry through multi-view rendering and optimization, with a bidirectional loop-closure mechanism feeding back into tracking. Key contributions include inter- and intra-submap registration, a differentiable global Gaussian map, and a robust backend that enforces global consistency while remaining memory-efficient. Experiments on 7-scenes and ScanNet-v2 show state-of-the-art tracking, high-fidelity geometry, and superior novel-view rendering, with a compact map size around 7 MB, highlighting practical impact for long indoor sequences and downstream tasks like NVS.

Abstract

Recent advances in dense 3D reconstruction enable the accurate capture of local geometry; however, integrating them into SLAM is challenging due to drift and redundant point maps, which limit efficiency and downstream tasks, such as novel view synthesis. To address these issues, we propose SING3R-SLAM, a globally consistent and compact Gaussian-based dense RGB SLAM framework. The key idea is to combine locally consistent 3D reconstructions with a unified global Gaussian representation that jointly refines scene geometry and camera poses, enabling efficient and versatile 3D mapping for multiple downstream applications. SING3R-SLAM first builds locally consistent submaps through our lightweight tracking and reconstruction module, and then progressively aligns and fuses them into a global Gaussian map that enforces cross-view geometric consistency. This global map, in turn, provides feedback to correct local drift and enhance the robustness of tracking. Extensive experiments demonstrate that SING3R-SLAM achieves state-of-the-art tracking, 3D reconstruction, and novel view rendering, resulting in over 12% improvement in tracking and producing finer, more detailed geometry, all while maintaining a compact and memory-efficient global representation on real-world datasets.

Paper Structure

This paper contains 24 sections, 13 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: SING3R-SLAM is a submap-based monocular SLAM system enhanced by 3D priors. Left: our key modules, where tracking produces locally accurate point maps, mapping fuses them into a compact global representation, and joint optimization further refines poses and geometry, aided by bidirectional loop closure. Right: the resulting Gaussian map supports multiple downstream tasks with global geometry consistency, extending our method beyond pose estimation.
  • Figure 2: Overview. Our system comprises three main components: Sub-Track3R (top-middle), Mapper (right), and Loop Closure (bottom-left). The top-left shows that these components interact and exchange data through the keyframe buffer to maintain consistency. The Sub-Track3R performs tracking between submaps, predicting point maps and local poses that are aligned into the world coordinate system via inter-submap registration. The Mapper employs a Gaussian model as a globally consistent and compact scene representation, jointly optimizing Gaussians and poses to achieve coherent geometry and improved reconstruction quality. In the Loop Closure, point map-based correction reduces large trajectory drift, and the updated Gaussian map further refines poses for globally consistent reconstruction.
  • Figure 3: Qualitative Comparison of Reconstructed Point Clouds on 7-scenes shotton2013scene. We show the reconstructed point clouds with zoomed-in views for all methods. Our approach provides a compact Gaussian representation that is much cleaner and captures object geometry in detail, as illustrated in the last column. In contrast, other 3D reconstruction-based methods often produce many redundant points, which degrade visual quality. Moreover, our dense point reconstruction preserves geometric structures more accurately.
  • Figure 4: Qualitative Comparison of Reconstructed Point Clouds on office. Left: RGB images from different views. Middle: VGGT-SLAM. Right: SING3R-SLAM (Ours). Our approach accurately aligns the wall and table across views, whereas VGGT-SLAM produces misaligned and overlapping geometry.
  • Figure 5: Qualitative Comparison of Reconstructed Meshes on Scannet-v2 dai2017scannet. We compare our reconstructed meshes with the Gaussian-based SLAM method HI-SLAM2 zhang2024hi. Our method successfully captures fine scene details, such as the bicycle in scene 0000 and the chair’s armrests in scene 0059, demonstrating superior geometric fidelity and reconstruction quality.
  • ...and 2 more figures