Table of Contents
Fetching ...

GlORIE-SLAM: Globally Optimized RGB-only Implicit Encoding Point Cloud SLAM

Ganlin Zhang, Erik Sandström, Youmin Zhang, Manthan Patel, Luc Van Gool, Martin R. Oswald

TL;DR

GlORIE-SLAM addresses the challenge of RGB-only dense SLAM by introducing a deformable neural point cloud map and a DSPO layer that fuses monocular depth priors into two-stage optimization for pose, disparity, and depth scale. The system performs online loop closure and global bundle adjustment to maintain global map consistency without retraining neural grids, and it renders via depth-guided volume rendering using a proxy depth map. Empirically, GlORIE-SLAM achieves state-of-the-art or competitive rendering, reconstruction, and tracking on Replica, TUM-RGBD, and ScanNet datasets, while maintaining reasonable memory and runtime. This approach offers a scalable, RGB-only solution with robust global consistency and high-fidelity rendering suitable for real-world indoor environments.

Abstract

Recent advancements in RGB-only dense Simultaneous Localization and Mapping (SLAM) have predominantly utilized grid-based neural implicit encodings and/or struggle to efficiently realize global map and pose consistency. To this end, we propose an efficient RGB-only dense SLAM system using a flexible neural point cloud scene representation that adapts to keyframe poses and depth updates, without needing costly backpropagation. Another critical challenge of RGB-only SLAM is the lack of geometric priors. To alleviate this issue, with the aid of a monocular depth estimator, we introduce a novel DSPO layer for bundle adjustment which optimizes the pose and depth of keyframes along with the scale of the monocular depth. Finally, our system benefits from loop closure and online global bundle adjustment and performs either better or competitive to existing dense neural RGB SLAM methods in tracking, mapping and rendering accuracy on the Replica, TUM-RGBD and ScanNet datasets. The source code is available at https://github.com/zhangganlin/GlOIRE-SLAM

GlORIE-SLAM: Globally Optimized RGB-only Implicit Encoding Point Cloud SLAM

TL;DR

GlORIE-SLAM addresses the challenge of RGB-only dense SLAM by introducing a deformable neural point cloud map and a DSPO layer that fuses monocular depth priors into two-stage optimization for pose, disparity, and depth scale. The system performs online loop closure and global bundle adjustment to maintain global map consistency without retraining neural grids, and it renders via depth-guided volume rendering using a proxy depth map. Empirically, GlORIE-SLAM achieves state-of-the-art or competitive rendering, reconstruction, and tracking on Replica, TUM-RGBD, and ScanNet datasets, while maintaining reasonable memory and runtime. This approach offers a scalable, RGB-only solution with robust global consistency and high-fidelity rendering suitable for real-world indoor environments.

Abstract

Recent advancements in RGB-only dense Simultaneous Localization and Mapping (SLAM) have predominantly utilized grid-based neural implicit encodings and/or struggle to efficiently realize global map and pose consistency. To this end, we propose an efficient RGB-only dense SLAM system using a flexible neural point cloud scene representation that adapts to keyframe poses and depth updates, without needing costly backpropagation. Another critical challenge of RGB-only SLAM is the lack of geometric priors. To alleviate this issue, with the aid of a monocular depth estimator, we introduce a novel DSPO layer for bundle adjustment which optimizes the pose and depth of keyframes along with the scale of the monocular depth. Finally, our system benefits from loop closure and online global bundle adjustment and performs either better or competitive to existing dense neural RGB SLAM methods in tracking, mapping and rendering accuracy on the Replica, TUM-RGBD and ScanNet datasets. The source code is available at https://github.com/zhangganlin/GlOIRE-SLAM
Paper Structure (17 sections, 48 equations, 6 figures, 12 tables)

This paper contains 17 sections, 48 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: GlORIE-SLAM Results on ScanNet scene 0000. GlORIE-SLAM uses a deformable point cloud as the scene representation and achieves lower trajectory error and higher rendering accuracy compared to competitive approaches, e.g. GO-SLAM. The geometric accuracy is qualitatively evaluated. The light blue trajectory is ground truth and the blue is the estimated. The PSNR is evaluated for all keyframes.
  • Figure 2: GlORIE-SLAM Architecture. Given an input RGB stream, we first track and then map every keyframe. The pose is initially estimated with local bundle adjustment (BA) via frame-to-frame tracking of recurrent optical flow estimation. This is done with our novel DSPO (Disparity, Scale and Pose Optimization) layer, which combines pose and depth estimation with scale and depth refinement by leveraging a monocular depth prior. The DSPO layer also refines the poses globally via online loop closure and global BA. To map the estimated pose, a proxy depth map is estimated by combining the noisy keyframe depths from the tracking module with the monocular depth prior to account for missing observations. Mapping is done, along with the input RGB keyframe via a deformable neural point cloud, leveraging depth guided volumetric rendering. A re-rendering loss to the input RGB and proxy depth optimizes the neural features and the color decoder weights. Importantly, the neural point cloud deforms to account for global updates of the poses and proxy depth before each mapping phase.
  • Figure 3: Rendering Results on Replica straub2019replica. The red boxes show blurry artifacts from GO-SLAM because of insufficient optimization when camera poses and depth are updated. Ours does not suffer from that by deforming the points accordingly. The green boxes show that ours can render high-frequency details well.
  • Figure 4: Reconstruction Results on ScanNet scene 0054. The red box shows the bathroom which GO-SLAM fails to reconstruct while ours succeeds. The yellow, blue and orange boxes show that our method can produce more detailed reconstructions.
  • Figure 5: Comparison of Estimated Depth. We show the depth output $\Tilde{D}$ from the tracker. The pixels which are invalid (high error) are colored dark blue. DBA is the method that Droid-SLAM teed2021droid uses. The DBA+mono prior strategy is used in HI-SLAM zhang2023hi, i.e. the mono prior supervises all pixels directly. It is clear that our formulation (DSPO) provides the most consistent keyframe depth.
  • ...and 1 more figures