Table of Contents
Fetching ...

DDN-SLAM: Real-time Dense Dynamic Neural Implicit SLAM

Mingrui Li, Yiming Zhou, Guangan Jiang, Tianchen Deng, Yangyang Wang, Hongyu Wang

TL;DR

DDN-SLAM addresses the core challenge of dynamic interference in neural implicit SLAM by integrating semantic understanding with a Gaussian Mixture depth prior to differentiate dynamic, static, and potentially static regions. It introduces a three-part pipeline: (i) semantic-guided segmentation with a depth-based two-Gaussian model and EM updates for robust foreground/background labeling, (ii) mixed background restoration combining optical-flow-based inpainting with sparse-point–guided sampling to preserve static structure, and (iii) dynamic NeRF rendering with a dedicated loss that enforces motion consistency and minimizes occlusion artifacts. The approach achieves real-time performance (~20 Hz) on monocular, stereo, and RGB-D inputs and demonstrates strong results across dynamic and challenging indoor scenes, including a reported average 90% improvement in ATE over prior neural implicit SLAM methods. By preserving potential dynamic objects and constraining dynamic occlusions, DDN-SLAM provides more complete, high-fidelity reconstructions suitable for robotics and AR/VR applications in dynamic environments.

Abstract

SLAM systems based on NeRF have demonstrated superior performance in rendering quality and scene reconstruction for static environments compared to traditional dense SLAM. However, they encounter tracking drift and mapping errors in real-world scenarios with dynamic interferences. To address these issues, we introduce DDN-SLAM, the first real-time dense dynamic neural implicit SLAM system integrating semantic features. To address dynamic tracking interferences, we propose a feature point segmentation method that combines semantic features with a mixed Gaussian distribution model. To avoid incorrect background removal, we propose a mapping strategy based on sparse point cloud sampling and background restoration. We propose a dynamic semantic loss to eliminate dynamic occlusions. Experimental results demonstrate that DDN-SLAM is capable of robustly tracking and producing high-quality reconstructions in dynamic environments, while appropriately preserving potential dynamic objects. Compared to existing neural implicit SLAM systems, the tracking results on dynamic datasets indicate an average 90% improvement in Average Trajectory Error (ATE) accuracy.

DDN-SLAM: Real-time Dense Dynamic Neural Implicit SLAM

TL;DR

DDN-SLAM addresses the core challenge of dynamic interference in neural implicit SLAM by integrating semantic understanding with a Gaussian Mixture depth prior to differentiate dynamic, static, and potentially static regions. It introduces a three-part pipeline: (i) semantic-guided segmentation with a depth-based two-Gaussian model and EM updates for robust foreground/background labeling, (ii) mixed background restoration combining optical-flow-based inpainting with sparse-point–guided sampling to preserve static structure, and (iii) dynamic NeRF rendering with a dedicated loss that enforces motion consistency and minimizes occlusion artifacts. The approach achieves real-time performance (~20 Hz) on monocular, stereo, and RGB-D inputs and demonstrates strong results across dynamic and challenging indoor scenes, including a reported average 90% improvement in ATE over prior neural implicit SLAM methods. By preserving potential dynamic objects and constraining dynamic occlusions, DDN-SLAM provides more complete, high-fidelity reconstructions suitable for robotics and AR/VR applications in dynamic environments.

Abstract

SLAM systems based on NeRF have demonstrated superior performance in rendering quality and scene reconstruction for static environments compared to traditional dense SLAM. However, they encounter tracking drift and mapping errors in real-world scenarios with dynamic interferences. To address these issues, we introduce DDN-SLAM, the first real-time dense dynamic neural implicit SLAM system integrating semantic features. To address dynamic tracking interferences, we propose a feature point segmentation method that combines semantic features with a mixed Gaussian distribution model. To avoid incorrect background removal, we propose a mapping strategy based on sparse point cloud sampling and background restoration. We propose a dynamic semantic loss to eliminate dynamic occlusions. Experimental results demonstrate that DDN-SLAM is capable of robustly tracking and producing high-quality reconstructions in dynamic environments, while appropriately preserving potential dynamic objects. Compared to existing neural implicit SLAM systems, the tracking results on dynamic datasets indicate an average 90% improvement in Average Trajectory Error (ATE) accuracy.
Paper Structure (13 sections, 19 equations, 4 figures, 4 tables)

This paper contains 13 sections, 19 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: System Overview. Our DDN-SLAM system consists of two main modules: tracking and mapping, divided into four threads. Our four threads optimize alternately. The segmentation thread detects and segments dynamic feature points and pixels, suppressing potential feature points. The tracking thread extracts feature points, receives feature points filtered by conditional filtering for tracking, obtains static optical flow, generates keyframes and camera poses. The mapping thread receives background segmentation masks for high and low dynamics, performs keyframe generation and volume rendering. The loop detection thread detects loops and performs global bundle adjustment. Our system can be updated in real time.
  • Figure 2: Our pixel-level clustering results of foreground depth probability verification demonstrate the accurate segmentation of dynamic object masks. The top section shows the segmentation results of YOLOv9yolov9github, while the bottom section displays our clustering results.
  • Figure 3: Our method achieves accurate segmentation of multiple targets in high-dynamic scenes of the Bonn dataset palazzolo2019refusion with multiple targets, where the human body is framed as a semantic framework for detection. We remove feature points within dynamic human bodies and preserve static feature points within the bounding box. Green marks the preserved feature points, and the red markers indicate the recovered feature points.
  • Figure 4: We compared the reconstruction results of traditional LC-CRFdu2020accurate SLAM, ESLAMyang2019cubeslam, Orbeez-SLAMchung2023orbeezslam, and our method (in both RGB and RGB-D modes) on dynamic sequences from TUM RGB-D sturm2012evaluating. We presented results on four TUM dynamic sequences, with two being high-dynamic sequences (involving significant human movement and high speed) and two being low-dynamic sequences (involving minimal human movement and slower speed). Our method demonstrates our ability to preserve potential dynamic objects, eliminate occlusion interference, and reduce the occurrence of artifacts, achieving high-quality reconstruction. Specifically, for a fair comparison, we employed rendering results based on ground truth poses for ESLAM to avoid reconstruction errors caused by tracking inaccuracies.