Table of Contents
Fetching ...

HI-SLAM2: Geometry-Aware Gaussian SLAM for Fast Monocular Scene Reconstruction

Wei Zhang, Qing Cheng, David Skuddis, Niclas Zeller, Daniel Cremers, Norbert Haala

TL;DR

HI-SLAM2 tackles monocular dense 3D reconstruction by fusing monocular depth priors with a learning-based dense SLAM front-end and a compact explicit map based on 3D Gaussian Splatting. It introduces a scale-grid depth alignment (JDSA) to stabilize monocular priors, a $Sim(3)$-based online pose graph BA for loop closure, and online/offline joint optimization of a 3D Gaussian map with exposure compensation. The system delivers fast, RGB-only reconstruction with superior geometry and rendering fidelity, outperforming Neural SLAM methods and often matching or beating RGB-D baselines across indoor and outdoor benchmarks, while supporting incremental map growth without a predefined scene boundary. These results indicate strong potential for real-time dense mapping in resource-constrained scenarios and lay groundwork for robust semantic extension in future work.

Abstract

We present HI-SLAM2, a geometry-aware Gaussian SLAM system that achieves fast and accurate monocular scene reconstruction using only RGB input. Existing Neural SLAM or 3DGS-based SLAM methods often trade off between rendering quality and geometry accuracy, our research demonstrates that both can be achieved simultaneously with RGB input alone. The key idea of our approach is to enhance the ability for geometry estimation by combining easy-to-obtain monocular priors with learning-based dense SLAM, and then using 3D Gaussian splatting as our core map representation to efficiently model the scene. Upon loop closure, our method ensures on-the-fly global consistency through efficient pose graph bundle adjustment and instant map updates by explicitly deforming the 3D Gaussian units based on anchored keyframe updates. Furthermore, we introduce a grid-based scale alignment strategy to maintain improved scale consistency in prior depths for finer depth details. Through extensive experiments on Replica, ScanNet, and ScanNet++, we demonstrate significant improvements over existing Neural SLAM methods and even surpass RGB-D-based methods in both reconstruction and rendering quality. The project page and source code will be made available at https://hi-slam2.github.io/.

HI-SLAM2: Geometry-Aware Gaussian SLAM for Fast Monocular Scene Reconstruction

TL;DR

HI-SLAM2 tackles monocular dense 3D reconstruction by fusing monocular depth priors with a learning-based dense SLAM front-end and a compact explicit map based on 3D Gaussian Splatting. It introduces a scale-grid depth alignment (JDSA) to stabilize monocular priors, a -based online pose graph BA for loop closure, and online/offline joint optimization of a 3D Gaussian map with exposure compensation. The system delivers fast, RGB-only reconstruction with superior geometry and rendering fidelity, outperforming Neural SLAM methods and often matching or beating RGB-D baselines across indoor and outdoor benchmarks, while supporting incremental map growth without a predefined scene boundary. These results indicate strong potential for real-time dense mapping in resource-constrained scenarios and lay groundwork for robust semantic extension in future work.

Abstract

We present HI-SLAM2, a geometry-aware Gaussian SLAM system that achieves fast and accurate monocular scene reconstruction using only RGB input. Existing Neural SLAM or 3DGS-based SLAM methods often trade off between rendering quality and geometry accuracy, our research demonstrates that both can be achieved simultaneously with RGB input alone. The key idea of our approach is to enhance the ability for geometry estimation by combining easy-to-obtain monocular priors with learning-based dense SLAM, and then using 3D Gaussian splatting as our core map representation to efficiently model the scene. Upon loop closure, our method ensures on-the-fly global consistency through efficient pose graph bundle adjustment and instant map updates by explicitly deforming the 3D Gaussian units based on anchored keyframe updates. Furthermore, we introduce a grid-based scale alignment strategy to maintain improved scale consistency in prior depths for finer depth details. Through extensive experiments on Replica, ScanNet, and ScanNet++, we demonstrate significant improvements over existing Neural SLAM methods and even surpass RGB-D-based methods in both reconstruction and rendering quality. The project page and source code will be made available at https://hi-slam2.github.io/.

Paper Structure

This paper contains 23 sections, 17 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Comparison of SLAM paradigms: while map-centric SLAM employs a unified map representation for both tracking and joint optimization, the hybrid design approach utilizes learning-based features and bundle adjustment for tracking, producing depth as an intermediate scene representation. This is then used to initialize the 3D map and supervise the joint optimization of camera poses and scene geometry.
  • Figure 2: System Overview: Our framework consists of four key stages: online camera tracking, online loop closing, online mapping, continuous mapping, and offline refinement. The camera tracking is performed using a recurrent-network-based approach to estimate camera poses $\mathbf{T}$ and generate depth maps $\mathbf{D}$ from RGB input. Depth priors are incorporated into the tracking process through our proposed Joint Depth and Scale Alignment (JDSA) strategy improving depth estimation accuracy. For 3D scene representation, we use 3DGS to model scene geometry, enabling efficient online map updates. These updates are integrated with $Sim(3)$-based pose graph Bundle Adjustment (BA) for online loop closing, allowing for scale drift correction via scale updates $\Delta \mathbf{s}$, and achieving both fast updates and high-quality rendering. In the offline refinement stage, camera poses and scene geometry undergo full BA, followed by joint optimization of Gaussian primitives and camera poses to further enhance global consistency.
  • Figure 3: Example of scale alignment of monocular depth.
  • Figure 4: View coverage analysis in two scenarios: (a) Optimal case where consecutive keyframes maintain sufficient overlap, ensuring proper multi-view coverage. (b) Suboptimal case where newly observed regions in keyframe $K_t$ lack adequate observations. Our system addresses this by inserting additional post-keyframes (shown in blue) to enhance view coverage.
  • Figure 5: Qualitative comparison on geometry reconstruction on Replica dataset.
  • ...and 8 more figures