Table of Contents
Fetching ...

DVN-SLAM: Dynamic Visual Neural SLAM Based on Local-Global Encoding

Wenhua Wu, Guangming Wang, Ting Deng, Sebastian Aegidius, Stuart Shanks, Valerio Modugno, Dimitrios Kanoulas, Hesheng Wang

TL;DR

DVN-SLAM tackles dynamic robustness in dense NeRF-based SLAM by introducing a local-global fusion neural implicit representation that combines a global One-Blob encoding with local axis-aligned feature planes. It uses attention-based feature fusion and a result fusion scheme to predict RGB and TSDF, and an information concentration loss based on depth variance to address rendering uncertainties. The approach achieves competitive static localization and mapping while maintaining robustness in highly dynamic indoor scenes across Replica and TUM-RGBD, outperforming several baselines and running in real time on an A100 GPU. This work advances dense SLAM by enabling plausible reconstructions for unobserved regions and robust operation under object motion, with potential impact on autonomous systems and AR/VR in dynamic environments.

Abstract

Recent research on Simultaneous Localization and Mapping (SLAM) based on implicit representation has shown promising results in indoor environments. However, there are still some challenges: the limited scene representation capability of implicit encodings, the uncertainty in the rendering process from implicit representations, and the disruption of consistency by dynamic objects. To address these challenges, we propose a real-time dynamic visual SLAM system based on local-global fusion neural implicit representation, named DVN-SLAM. To improve the scene representation capability, we introduce a local-global fusion neural implicit representation that enables the construction of an implicit map while considering both global structure and local details. To tackle uncertainties arising from the rendering process, we design an information concentration loss for optimization, aiming to concentrate scene information on object surfaces. The proposed DVN-SLAM achieves competitive performance in localization and mapping across multiple datasets. More importantly, DVN-SLAM demonstrates robustness in dynamic scenes, a trait that sets it apart from other NeRF-based methods.

DVN-SLAM: Dynamic Visual Neural SLAM Based on Local-Global Encoding

TL;DR

DVN-SLAM tackles dynamic robustness in dense NeRF-based SLAM by introducing a local-global fusion neural implicit representation that combines a global One-Blob encoding with local axis-aligned feature planes. It uses attention-based feature fusion and a result fusion scheme to predict RGB and TSDF, and an information concentration loss based on depth variance to address rendering uncertainties. The approach achieves competitive static localization and mapping while maintaining robustness in highly dynamic indoor scenes across Replica and TUM-RGBD, outperforming several baselines and running in real time on an A100 GPU. This work advances dense SLAM by enabling plausible reconstructions for unobserved regions and robust operation under object motion, with potential impact on autonomous systems and AR/VR in dynamic environments.

Abstract

Recent research on Simultaneous Localization and Mapping (SLAM) based on implicit representation has shown promising results in indoor environments. However, there are still some challenges: the limited scene representation capability of implicit encodings, the uncertainty in the rendering process from implicit representations, and the disruption of consistency by dynamic objects. To address these challenges, we propose a real-time dynamic visual SLAM system based on local-global fusion neural implicit representation, named DVN-SLAM. To improve the scene representation capability, we introduce a local-global fusion neural implicit representation that enables the construction of an implicit map while considering both global structure and local details. To tackle uncertainties arising from the rendering process, we design an information concentration loss for optimization, aiming to concentrate scene information on object surfaces. The proposed DVN-SLAM achieves competitive performance in localization and mapping across multiple datasets. More importantly, DVN-SLAM demonstrates robustness in dynamic scenes, a trait that sets it apart from other NeRF-based methods.
Paper Structure (21 sections, 17 equations, 14 figures, 5 tables)

This paper contains 21 sections, 17 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: We propose DVN-SLAM, a real-time dynamic robust dense visual SLAM system based on local-global fusion neural implicit representation. Compared to current NeRF-based SLAM methods, such as iMAP sucar2021imap, NICE-SLAM zhu2022nice, Vox-Fusion yang2022vox, ESLAMjohari2023eslam and Co-SLAM wang2023co, DVN-SLAM not only achieves competitive performance in static scenes, but also remains effective in high-dynamic scenes. Figure a) is the input dynamic scene video stream. Figure b) showcases the localization and mapping results of DVN-SLAM. Figure c) illustrates the rendering visualization during the mapping process, demonstrating the successful removal of dynamic humans and background completion.
  • Figure 2: Overview of DVN-SLAM. The left side of the framework is the explicit perception module, which takes RGB-D video streams as input. The decision of adding a frame as a keyframe is based on the information gain from the image. On the right side of the framework is the local-global fusion neural implicit representation, which establishes a mapping from spatial locations to color and TSDF values. Volume rendering is performed to generate RGB-D images corresponding to the tracking poses. The middle is our loss optimization module, which includes color loss, depth loss, TSDF loss, and information concentration loss.
  • Figure 3: Local-global fusion neural implicit representation. The blue region represents the global representation, while the green region represents the local representation. We employ two methods, as shown in the red, attention-based feature fusion, and result fusion, to merge the global and local representations, achieving stronger representational capacity and more stable neural implicit representation.
  • Figure 4: Illustration of Volume Rendering. Different information distributions along the same ray may yield the same rendering result. This introduces uncertainty in the distribution of scene information even when the rendering result is determined.
  • Figure 5: Reconstruction results of room0. The chair region is magnified to highlight. Compared to other methods, DVN-SLAM can achieve accurate modeling of local details and global structure, providing more reasonable predictions for unobserved regions, such as the backside of the chair shown in the figure.
  • ...and 9 more figures