Table of Contents
Fetching ...

SplatMAP: Online Dense Monocular SLAM with 3D Gaussian Splatting

Yue Hu, Rong Liu, Meida Chen, Peter Beerel, Andrew Feng

TL;DR

SplatMap tackles the challenge of high-fidelity dense monocular 3D reconstruction by unifying dense SLAM with 3D Gaussian Splatting. It introduces SLAM-Informed Adaptive Densification (SIAD) to dynamically prune and densify the Gaussian scene using real-time SLAM depth/pose updates, and Geometry-Guided Optimization that jointly optimizes appearance and geometry with an RGB loss incorporating MS-SSIM and an edge-aware normal loss. The mapping pipeline converts SLAM outputs into a dense Gaussian representation with covariance decomposition, rendered efficiently via a differentiable tile rasterizer, and updated through keyframe-driven optimization. Empirical results on Replica and TUM-RGBD show state-of-the-art monocular performance in PSNR, SSIM, and LPIPS, often surpassing RGB-D baselines while maintaining real-time capabilities, demonstrating the framework’s potential to bridge photometric and geometric dense representations in practical monocular systems. These findings suggest significant implications for real-time robotics, AR/VR, and autonomous navigation where high-quality dense monocular reconstructions are critical.

Abstract

Achieving high-fidelity 3D reconstruction from monocular video remains challenging due to the inherent limitations of traditional methods like Structure-from-Motion (SfM) and monocular SLAM in accurately capturing scene details. While differentiable rendering techniques such as Neural Radiance Fields (NeRF) address some of these challenges, their high computational costs make them unsuitable for real-time applications. Additionally, existing 3D Gaussian Splatting (3DGS) methods often focus on photometric consistency, neglecting geometric accuracy and failing to exploit SLAM's dynamic depth and pose updates for scene refinement. We propose a framework integrating dense SLAM with 3DGS for real-time, high-fidelity dense reconstruction. Our approach introduces SLAM-Informed Adaptive Densification, which dynamically updates and densifies the Gaussian model by leveraging dense point clouds from SLAM. Additionally, we incorporate Geometry-Guided Optimization, which combines edge-aware geometric constraints and photometric consistency to jointly optimize the appearance and geometry of the 3DGS scene representation, enabling detailed and accurate SLAM mapping reconstruction. Experiments on the Replica and TUM-RGBD datasets demonstrate the effectiveness of our approach, achieving state-of-the-art results among monocular systems. Specifically, our method achieves a PSNR of 36.864, SSIM of 0.985, and LPIPS of 0.040 on Replica, representing improvements of 10.7%, 6.4%, and 49.4%, respectively, over the previous SOTA. On TUM-RGBD, our method outperforms the closest baseline by 10.2%, 6.6%, and 34.7% in the same metrics. These results highlight the potential of our framework in bridging the gap between photometric and geometric dense 3D scene representations, paving the way for practical and efficient monocular dense reconstruction.

SplatMAP: Online Dense Monocular SLAM with 3D Gaussian Splatting

TL;DR

SplatMap tackles the challenge of high-fidelity dense monocular 3D reconstruction by unifying dense SLAM with 3D Gaussian Splatting. It introduces SLAM-Informed Adaptive Densification (SIAD) to dynamically prune and densify the Gaussian scene using real-time SLAM depth/pose updates, and Geometry-Guided Optimization that jointly optimizes appearance and geometry with an RGB loss incorporating MS-SSIM and an edge-aware normal loss. The mapping pipeline converts SLAM outputs into a dense Gaussian representation with covariance decomposition, rendered efficiently via a differentiable tile rasterizer, and updated through keyframe-driven optimization. Empirical results on Replica and TUM-RGBD show state-of-the-art monocular performance in PSNR, SSIM, and LPIPS, often surpassing RGB-D baselines while maintaining real-time capabilities, demonstrating the framework’s potential to bridge photometric and geometric dense representations in practical monocular systems. These findings suggest significant implications for real-time robotics, AR/VR, and autonomous navigation where high-quality dense monocular reconstructions are critical.

Abstract

Achieving high-fidelity 3D reconstruction from monocular video remains challenging due to the inherent limitations of traditional methods like Structure-from-Motion (SfM) and monocular SLAM in accurately capturing scene details. While differentiable rendering techniques such as Neural Radiance Fields (NeRF) address some of these challenges, their high computational costs make them unsuitable for real-time applications. Additionally, existing 3D Gaussian Splatting (3DGS) methods often focus on photometric consistency, neglecting geometric accuracy and failing to exploit SLAM's dynamic depth and pose updates for scene refinement. We propose a framework integrating dense SLAM with 3DGS for real-time, high-fidelity dense reconstruction. Our approach introduces SLAM-Informed Adaptive Densification, which dynamically updates and densifies the Gaussian model by leveraging dense point clouds from SLAM. Additionally, we incorporate Geometry-Guided Optimization, which combines edge-aware geometric constraints and photometric consistency to jointly optimize the appearance and geometry of the 3DGS scene representation, enabling detailed and accurate SLAM mapping reconstruction. Experiments on the Replica and TUM-RGBD datasets demonstrate the effectiveness of our approach, achieving state-of-the-art results among monocular systems. Specifically, our method achieves a PSNR of 36.864, SSIM of 0.985, and LPIPS of 0.040 on Replica, representing improvements of 10.7%, 6.4%, and 49.4%, respectively, over the previous SOTA. On TUM-RGBD, our method outperforms the closest baseline by 10.2%, 6.6%, and 34.7% in the same metrics. These results highlight the potential of our framework in bridging the gap between photometric and geometric dense 3D scene representations, paving the way for practical and efficient monocular dense reconstruction.
Paper Structure (28 sections, 9 equations, 6 figures, 5 tables)

This paper contains 28 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Visual comparison of reconstructed scenes. Our method (SplatMap) achieves superior rendering fidelity and geometric accuracy, closely matching the ground truth, while outperforming GIORIE-SLAM zhang2024glorie and Photo-SLAM huang2024photo in fine structural details.
  • Figure 2: Overview of our proposed system. The framework integrates SLAM-based pose estimation and depth refinement with Gaussian-based 3D scene representations. A convolutional GRU module iteratively refines depth $d$ and pose $T$ using confidence weights $\omega_{ij}$ and revision flow $r_{ij}$. The refined depth and pose are projected to a Gaussian representation $G$, which is optimized to minimize geometric loss ($L_{\text{geo}}$) and photometric loss ($L_{\text{rgb}}$) through a differentiable tile rasterizer. This process generates accurate surface normals and high-quality RGB reconstructions, improving 3DGS SLAM representation fidelity.
  • Figure 3: When the Gaussian scale is set to 0.01, the inaccuracy of Gaussian position estimation becomes evident, particularly during the initial phase of SLAM when data is sparse. The accumulation of non-updated Gaussian point clouds results in ghosting artifacts on walls and furniture, which degrades both the geometric quality of the reconstruction and the final rendering performance.
  • Figure 4: Mapping Comparison On Replica Dataset
  • Figure 5: Mapping Comparison On TUM dataset
  • ...and 1 more figures