Table of Contents
Fetching ...

SP-SLAM: Neural Real-Time Dense SLAM With Scene Priors

Zhen Hong, Bowen Wang, Haoran Duan, Yawen Huang, Xiong Li, Zhenyu Wen, Xiang Wu, Wei Xiang, Yefeng Zheng

TL;DR

SP-SLAM tackles real-time dense SLAM by injecting scene priors into a neural implicit framework. It encodes depth-derived priors into a sparse voxel volume and stores appearance on tri-planes, enabling rapid convergence and high-fidelity geometry and texture without relying on keyframes. A pixel-database driven optimization enables continuous refinement of all frame poses during mapping, achieving accurate tracking with fewer iterations and real-time performance. Across five benchmark datasets, SP-SLAM demonstrates superior tracking accuracy, reconstruction quality, and significantly faster speed than existing neural SLAM methods, highlighting its practical value for real-time robotics and AR/VR applications.

Abstract

Neural implicit representations have recently shown promising progress in dense Simultaneous Localization And Mapping (SLAM). However, existing works have shortcomings in terms of reconstruction quality and real-time performance, mainly due to inflexible scene representation strategy without leveraging any prior information. In this paper, we introduce SP-SLAM, a novel neural RGB-D SLAM system that performs tracking and mapping in real-time. SP-SLAM computes depth images and establishes sparse voxel-encoded scene priors near the surfaces to achieve rapid convergence of the model. Subsequently, the encoding voxels computed from single-frame depth image are fused into a global volume, which facilitates high-fidelity surface reconstruction. Simultaneously, we employ tri-planes to store scene appearance information, striking a balance between achieving high-quality geometric texture mapping and minimizing memory consumption. Furthermore, in SP-SLAM, we introduce an effective optimization strategy for mapping, allowing the system to continuously optimize the poses of all historical input frames during runtime without increasing computational overhead. We conduct extensive evaluations on five benchmark datasets (Replica, ScanNet, TUM RGB-D, Synthetic RGB-D, 7-Scenes). The results demonstrate that, compared to existing methods, we achieve superior tracking accuracy and reconstruction quality, while running at a significantly faster speed.

SP-SLAM: Neural Real-Time Dense SLAM With Scene Priors

TL;DR

SP-SLAM tackles real-time dense SLAM by injecting scene priors into a neural implicit framework. It encodes depth-derived priors into a sparse voxel volume and stores appearance on tri-planes, enabling rapid convergence and high-fidelity geometry and texture without relying on keyframes. A pixel-database driven optimization enables continuous refinement of all frame poses during mapping, achieving accurate tracking with fewer iterations and real-time performance. Across five benchmark datasets, SP-SLAM demonstrates superior tracking accuracy, reconstruction quality, and significantly faster speed than existing neural SLAM methods, highlighting its practical value for real-time robotics and AR/VR applications.

Abstract

Neural implicit representations have recently shown promising progress in dense Simultaneous Localization And Mapping (SLAM). However, existing works have shortcomings in terms of reconstruction quality and real-time performance, mainly due to inflexible scene representation strategy without leveraging any prior information. In this paper, we introduce SP-SLAM, a novel neural RGB-D SLAM system that performs tracking and mapping in real-time. SP-SLAM computes depth images and establishes sparse voxel-encoded scene priors near the surfaces to achieve rapid convergence of the model. Subsequently, the encoding voxels computed from single-frame depth image are fused into a global volume, which facilitates high-fidelity surface reconstruction. Simultaneously, we employ tri-planes to store scene appearance information, striking a balance between achieving high-quality geometric texture mapping and minimizing memory consumption. Furthermore, in SP-SLAM, we introduce an effective optimization strategy for mapping, allowing the system to continuously optimize the poses of all historical input frames during runtime without increasing computational overhead. We conduct extensive evaluations on five benchmark datasets (Replica, ScanNet, TUM RGB-D, Synthetic RGB-D, 7-Scenes). The results demonstrate that, compared to existing methods, we achieve superior tracking accuracy and reconstruction quality, while running at a significantly faster speed.
Paper Structure (22 sections, 12 equations, 9 figures, 9 tables)

This paper contains 22 sections, 12 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: The impact of mapping optimization strategies on our system. The green trajectory represents the ground truth camera motion, while the red trajectory represents the estimated camera motion. Compared to selecting a set of keyframes to maintain the scene map, optimizing each input frame using sparsely sampled pixels can achieve more robust camera tracking and more realistic scene reconstruction.
  • Figure 2: Overview of SP-SLAM. The depth encoder extracts local geometric priors from the depth image and fuses them into a global sparse volume. Our hybrid scene representation consists of sparse volumes representing geometry, three planes representing appearance, and two shallow MLP decoders. We calculate the rays emitted from the camera and sample them layer by layer based on the estimated camera pose, and then predict the color and TSDF of each sampling point through our scene representation. Volume rendering predicts the color and depth of rays (Sec. \ref{['sec:rendering']}). The overall objective function consists of re-rendering losses and geometric losses (Sec. \ref{['optimization']}). The tracking process optimizes the camera pose of the current frame by minimizing the overall objective function, while the mapping process jointly optimizes the camera pose and scene representation of the selected frame.
  • Figure 3: Qualitative comparison in reconstruction on Replica datasetreplica. The region highlighted by the green rectangle showcases the higher fidelity of our geometry, and the region highlighted by the red rectangle demonstrates that our method is capable of generating smoother surfaces.
  • Figure 4: Qualitative comparison in reconstruction on Synthetic RGB-D datasetneuralrgbd. Our method can produce clearer and more detailed geometric structures.
  • Figure 5: Qualitative comparison in reconstruction quality on ScanNet datasetscannet. The ground truth mesh for ScanNet is obtained through BundleFusionbundlefusion. Compared to existing methods, our method generates smoother scene surfaces and more detailed geometric structures.
  • ...and 4 more figures