Table of Contents
Fetching ...

PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment

Tianchen Deng, Guole Shen, Tong Qin, Jianyu Wang, Wentao Zhao, Jingchuan Wang, Danwei Wang, Weidong Chen

TL;DR

PLGSLAM tackles scaling issues in neural implicit SLAM for large indoor environments by introducing progressive, locally scoped scene representations that expand capacity as the camera moves. It fuses a tri-plane high-frequency feature map with a one-blob encoded MLP for low-frequency coherence and employs differentiable rendering to supervise geometry and appearance. A novel local-to-global bundle adjustment, driven by a global keyframe database and neural warping/reprojection losses, mitigates cumulative pose drift over long sequences. Across Replica, ScanNet, and Apartment datasets, PLGSLAM achieves state-of-the-art surface reconstruction and tracking with real-time performance, while substantially reducing memory growth compared to cubic scaling.

Abstract

Neural implicit scene representations have recently shown encouraging results in dense visual SLAM. However, existing methods produce low-quality scene reconstruction and low-accuracy localization performance when scaling up to large indoor scenes and long sequences. These limitations are mainly due to their single, global radiance field with finite capacity, which does not adapt to large scenarios. Their end-to-end pose networks are also not robust enough with the growth of cumulative errors in large scenes. To this end, we introduce PLGSLAM, a neural visual SLAM system capable of high-fidelity surface reconstruction and robust camera tracking in real-time. To handle large-scale indoor scenes, PLGSLAM proposes a progressive scene representation method which dynamically allocates new local scene representation trained with frames within a local sliding window. This allows us to scale up to larger indoor scenes and improves robustness (even under pose drifts). In local scene representation, PLGSLAM utilizes tri-planes for local high-frequency features with multi-layer perceptron (MLP) networks for the low-frequency feature, achieving smoothness and scene completion in unobserved areas. Moreover, we propose local-to-global bundle adjustment method with a global keyframe database to address the increased pose drifts on long sequences. Experimental results demonstrate that PLGSLAM achieves state-of-the-art scene reconstruction results and tracking performance across various datasets and scenarios (both in small and large-scale indoor environments). The code is open-sourced at https://github.com/dtc111111/plgslam.

PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment

TL;DR

PLGSLAM tackles scaling issues in neural implicit SLAM for large indoor environments by introducing progressive, locally scoped scene representations that expand capacity as the camera moves. It fuses a tri-plane high-frequency feature map with a one-blob encoded MLP for low-frequency coherence and employs differentiable rendering to supervise geometry and appearance. A novel local-to-global bundle adjustment, driven by a global keyframe database and neural warping/reprojection losses, mitigates cumulative pose drift over long sequences. Across Replica, ScanNet, and Apartment datasets, PLGSLAM achieves state-of-the-art surface reconstruction and tracking with real-time performance, while substantially reducing memory growth compared to cubic scaling.

Abstract

Neural implicit scene representations have recently shown encouraging results in dense visual SLAM. However, existing methods produce low-quality scene reconstruction and low-accuracy localization performance when scaling up to large indoor scenes and long sequences. These limitations are mainly due to their single, global radiance field with finite capacity, which does not adapt to large scenarios. Their end-to-end pose networks are also not robust enough with the growth of cumulative errors in large scenes. To this end, we introduce PLGSLAM, a neural visual SLAM system capable of high-fidelity surface reconstruction and robust camera tracking in real-time. To handle large-scale indoor scenes, PLGSLAM proposes a progressive scene representation method which dynamically allocates new local scene representation trained with frames within a local sliding window. This allows us to scale up to larger indoor scenes and improves robustness (even under pose drifts). In local scene representation, PLGSLAM utilizes tri-planes for local high-frequency features with multi-layer perceptron (MLP) networks for the low-frequency feature, achieving smoothness and scene completion in unobserved areas. Moreover, we propose local-to-global bundle adjustment method with a global keyframe database to address the increased pose drifts on long sequences. Experimental results demonstrate that PLGSLAM achieves state-of-the-art scene reconstruction results and tracking performance across various datasets and scenarios (both in small and large-scale indoor environments). The code is open-sourced at https://github.com/dtc111111/plgslam.
Paper Structure (13 sections, 13 equations, 6 figures, 5 tables)

This paper contains 13 sections, 13 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Large-scale indoor scene 3D Reconstruction with different methods. We depict the final mesh and camera tracking trajectory error (Absolute Trajectory Error) of different methods. The color bar on the right shows the relative scaling of color. PLGSLAM outperforms others in both scene reconstruction and pose estimation.
  • Figure 2: The isometric view of the proposed PLGSLAM system. Our system has two parallel threads: the mapping thread and the tracking thread. In the mapping thread, we propose the progressive scene representation method for the entire scene. In local scene representation, we combine the tri-planes with the multi-layer perceptron to improve the accuracy and smoothness. Both of them are online updated by minimizing our carefully designed loss through differentiable rendering with the system operating. As for the tracking thread, we propose a local-to-global bundle adjustment for accurate and robust pose estimation. Those two threads are running with an alternating optimization.
  • Figure 3: This figure illustrates the designed neural warping loss. We calculate the neural warpping loss between keyframe $I$ and keyframe $I'$.
  • Figure 4: Reconstruction results (without cull) on Replica replica apartment dataset. In comparison to our baselines, our methods achieve accurate and high-quality scene reconstruction and completion on various scenes.The region outlined on the image is marked in red to signify lower predictive accuracy, in green to signify higher accuracy, and in yellow to represent the ground truth results. The number in the bottom right corner of the image represents the completion ratio metric.
  • Figure 5: Qualitative comparison of our proposed PLGSLAM method’s surface reconstruction and localization accuracy with existing NeRF-based dense visual SLAM methods, NICE-SLAM niceslam, Co-SLAM coslam, and ESLAM eslam on the ScanNet dataset scannet. The ground truth camera trajectory is shown in blue, and the estimated trajectory is shown in red. Our method predicts more accurate camera trajectories and does not suffer from drifting issues. We also visualize the Absolute Trajectory Error ATE (bottom color bar) of different methods. The color bar on the right shows the relative scaling of color. It should also be noted that our method runs faster on this dataset.
  • ...and 1 more figures