Table of Contents
Fetching ...

Splat-SLAM: Globally Optimized RGB-only SLAM with 3D Gaussians

Erik Sandström, Keisuke Tateno, Michael Oechsle, Michael Niemeyer, Luc Van Gool, Martin R. Oswald, Federico Tombari

TL;DR

The paper tackles the challenge of robust, RGB-only dense SLAM by introducing Splat-SLAM, a pipeline that jointly optimizes camera tracking and a deformable 3D Gaussian map while using a proxy depth that fuses multi-view and monocular estimates. It combines frame-to-frame dense tracking with online loop closure and global bundle adjustment, enabling online deformation of the Gaussian map to maintain global consistency. Key contributions include the first RGB-only system to integrate loop closure, proxy depth, and online 3D Gaussian deformation, achieving superior rendering and reconstruction on Replica, TUM-RGBD, and ScanNet with competitive tracking and efficient memory usage. The results demonstrate high-quality dense rendering and geometry from RGB input alone, with practical performance for real-world scenarios.

Abstract

3D Gaussian Splatting has emerged as a powerful representation of geometry and appearance for RGB-only dense Simultaneous Localization and Mapping (SLAM), as it provides a compact dense map representation while enabling efficient and high-quality map rendering. However, existing methods show significantly worse reconstruction quality than competing methods using other 3D representations, e.g. neural points clouds, since they either do not employ global map and pose optimization or make use of monocular depth. In response, we propose the first RGB-only SLAM system with a dense 3D Gaussian map representation that utilizes all benefits of globally optimized tracking by adapting dynamically to keyframe pose and depth updates by actively deforming the 3D Gaussian map. Moreover, we find that refining the depth updates in inaccurate areas with a monocular depth estimator further improves the accuracy of the 3D reconstruction. Our experiments on the Replica, TUM-RGBD, and ScanNet datasets indicate the effectiveness of globally optimized 3D Gaussians, as the approach achieves superior or on par performance with existing RGB-only SLAM methods methods in tracking, mapping and rendering accuracy while yielding small map sizes and fast runtimes. The source code is available at https://github.com/eriksandstroem/Splat-SLAM.

Splat-SLAM: Globally Optimized RGB-only SLAM with 3D Gaussians

TL;DR

The paper tackles the challenge of robust, RGB-only dense SLAM by introducing Splat-SLAM, a pipeline that jointly optimizes camera tracking and a deformable 3D Gaussian map while using a proxy depth that fuses multi-view and monocular estimates. It combines frame-to-frame dense tracking with online loop closure and global bundle adjustment, enabling online deformation of the Gaussian map to maintain global consistency. Key contributions include the first RGB-only system to integrate loop closure, proxy depth, and online 3D Gaussian deformation, achieving superior rendering and reconstruction on Replica, TUM-RGBD, and ScanNet with competitive tracking and efficient memory usage. The results demonstrate high-quality dense rendering and geometry from RGB input alone, with practical performance for real-world scenarios.

Abstract

3D Gaussian Splatting has emerged as a powerful representation of geometry and appearance for RGB-only dense Simultaneous Localization and Mapping (SLAM), as it provides a compact dense map representation while enabling efficient and high-quality map rendering. However, existing methods show significantly worse reconstruction quality than competing methods using other 3D representations, e.g. neural points clouds, since they either do not employ global map and pose optimization or make use of monocular depth. In response, we propose the first RGB-only SLAM system with a dense 3D Gaussian map representation that utilizes all benefits of globally optimized tracking by adapting dynamically to keyframe pose and depth updates by actively deforming the 3D Gaussian map. Moreover, we find that refining the depth updates in inaccurate areas with a monocular depth estimator further improves the accuracy of the 3D reconstruction. Our experiments on the Replica, TUM-RGBD, and ScanNet datasets indicate the effectiveness of globally optimized 3D Gaussians, as the approach achieves superior or on par performance with existing RGB-only SLAM methods methods in tracking, mapping and rendering accuracy while yielding small map sizes and fast runtimes. The source code is available at https://github.com/eriksandstroem/Splat-SLAM.
Paper Structure (17 sections, 14 equations, 5 figures, 18 tables)

This paper contains 17 sections, 14 equations, 5 figures, 18 tables.

Figures (5)

  • Figure 1: Splat-SLAM. Our system yields accurate scene reconstruction (rendering depth L1) and rendering (PSNR) and on par tracking accuracy (ATE RMSE) to GlORIE-SLAM and map size to MonoGS. The results averaged over all keyframes. The scene is from TUM-RGBD Sturm2012ASystemsfr1 room.
  • Figure 2: Splat-SLAM Architecture. Given an RGB input stream, we track and map each keyframe, initially estimating poses through local bundle adjustment (BA) using a DSPO (Disparity, Scale and Pose Optimization) layer. This layer integrates pose and depth estimation, enhancing depth with monocular depth. It further refines poses globally via online loop closure and global BA. The proxy depth map merges keyframe depths $\tilde{D}$ from the tracking with monocular depth $D^{mono}$ to fill gaps. Mapping employs a deformable 3D Gaussian map, optimizing its parameters through a re-rendering loss. Notably, the 3D map adjusts for global pose and depth updates before each mapping phase.
  • Figure 3: Rendering Results on ScanNet Dai2017ScanNet and TUM-RGBD Sturm2012ASystems. Our method yields better rendering quality than GlORIE-SLAM and MonoGS. Top row: the orange box shows artifacts from GlORIE-SLAM, partly due to the depth guided volume rendering. The yellow box shows an area with redundant floating points. The red box shows a rendering distortion, likely from the large trajectory error. The green boxes show that our method fuses information from multiple views to avoid motion blur, present in the input. Fourth row: The rendering is from the pose of the red box in the third row.
  • Figure 4: Reconstruction Results on Replica straub2019replica on Normal Shaded Meshes. Our method achieves higher geometric accuracy compared to existing works. In particular, GlORIE-SLAM suffers from floating point artifacts (e.g. second row) where our method reconstructs even the individual legs of the table. MonoGS suffers significantly from a lack of proxy depth, despite multiview optimization.
  • Figure S6: Reconstruction Results on Replica straub2019replica. Our method improves upon the geometric accuracy compared to existing works, when observing the normal shaded meshes. In particular, GlORIE-SLAM suffers from floating point artifacts. MonoGS suffers badly from a lack of proxy depth, despite multiview optimization.