Table of Contents
Fetching ...

D$^2$GSLAM: 4D Dynamic Gaussian Splatting SLAM

Siting Zhu, Yuxiang Huang, Wenhua Wu, Chaokang Jiang, Yongbo Chen, I-Ming Chen, Hesheng Wang

TL;DR

D$^2$GSLAM introduces a Gaussian-based dynamic SLAM framework that jointly reconstructs static and dynamic scene parts and tracks the camera in dynamic environments. Key ideas include a geometric-prompt dynamic separation to generate robust motion masks, a dynamic-static composite map combining 3D static Gaussians with 4D dynamic Gaussians, retrospective frame optimization to maintain temporal coherence, and a motion-consistency loss to leverage temporal dynamics. The system employs a progressive tracking strategy and comprehensive losses to achieve accurate dynamic modeling, outperforming state-of-the-art baselines on Bonn, TUM, and static datasets in both tracking and reconstruction metrics. Although not real-time for dynamic modeling, the method demonstrates substantial improvements in dynamic scene understanding with practical runtime for motion segmentation and robust tracking in real-world indoor environments.

Abstract

Recent advances in Dense Simultaneous Localization and Mapping (SLAM) have demonstrated remarkable performance in static environments. However, dense SLAM in dynamic environments remains challenging. Most methods directly remove dynamic objects and focus solely on static scene reconstruction, which ignores the motion information contained in these dynamic objects. In this paper, we present D$^2$GSLAM, a novel dynamic SLAM system utilizing Gaussian representation, which simultaneously performs accurate dynamic reconstruction and robust tracking within dynamic environments. Our system is composed of four key components: (i) We propose a geometric-prompt dynamic separation method to distinguish between static and dynamic elements of the scene. This approach leverages the geometric consistency of Gaussian representation and scene geometry to obtain coarse dynamic regions. The regions then serve as prompts to guide the refinement of the coarse mask for achieving accurate motion mask. (ii) To facilitate accurate and efficient mapping of the dynamic scene, we introduce dynamic-static composite representation that integrates static 3D Gaussians with dynamic 4D Gaussians. This representation allows for modeling the transitions between static and dynamic states of objects in the scene for composite mapping and optimization. (iii) We employ a progressive pose refinement strategy that leverages both the multi-view consistency of static scene geometry and motion information from dynamic objects to achieve accurate camera tracking. (iv) We introduce a motion consistency loss, which leverages the temporal continuity in object motions for accurate dynamic modeling. Our D$^2$GSLAM demonstrates superior performance on dynamic scenes in terms of mapping and tracking accuracy, while also showing capability in accurate dynamic modeling.

D$^2$GSLAM: 4D Dynamic Gaussian Splatting SLAM

TL;DR

DGSLAM introduces a Gaussian-based dynamic SLAM framework that jointly reconstructs static and dynamic scene parts and tracks the camera in dynamic environments. Key ideas include a geometric-prompt dynamic separation to generate robust motion masks, a dynamic-static composite map combining 3D static Gaussians with 4D dynamic Gaussians, retrospective frame optimization to maintain temporal coherence, and a motion-consistency loss to leverage temporal dynamics. The system employs a progressive tracking strategy and comprehensive losses to achieve accurate dynamic modeling, outperforming state-of-the-art baselines on Bonn, TUM, and static datasets in both tracking and reconstruction metrics. Although not real-time for dynamic modeling, the method demonstrates substantial improvements in dynamic scene understanding with practical runtime for motion segmentation and robust tracking in real-world indoor environments.

Abstract

Recent advances in Dense Simultaneous Localization and Mapping (SLAM) have demonstrated remarkable performance in static environments. However, dense SLAM in dynamic environments remains challenging. Most methods directly remove dynamic objects and focus solely on static scene reconstruction, which ignores the motion information contained in these dynamic objects. In this paper, we present DGSLAM, a novel dynamic SLAM system utilizing Gaussian representation, which simultaneously performs accurate dynamic reconstruction and robust tracking within dynamic environments. Our system is composed of four key components: (i) We propose a geometric-prompt dynamic separation method to distinguish between static and dynamic elements of the scene. This approach leverages the geometric consistency of Gaussian representation and scene geometry to obtain coarse dynamic regions. The regions then serve as prompts to guide the refinement of the coarse mask for achieving accurate motion mask. (ii) To facilitate accurate and efficient mapping of the dynamic scene, we introduce dynamic-static composite representation that integrates static 3D Gaussians with dynamic 4D Gaussians. This representation allows for modeling the transitions between static and dynamic states of objects in the scene for composite mapping and optimization. (iii) We employ a progressive pose refinement strategy that leverages both the multi-view consistency of static scene geometry and motion information from dynamic objects to achieve accurate camera tracking. (iv) We introduce a motion consistency loss, which leverages the temporal continuity in object motions for accurate dynamic modeling. Our DGSLAM demonstrates superior performance on dynamic scenes in terms of mapping and tracking accuracy, while also showing capability in accurate dynamic modeling.

Paper Structure

This paper contains 25 sections, 15 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of D$^2$GSLAM. Our method takes sequential RGB-D frames as input. RGB-D images are first fed into dynamic detector to distinguish whether there are dynamic objects in the current frame. Subsequently, dynamic mask of the current frame is generated through motion mask generation. We perform static-only initialization and dynamic-static refinement, followed by dense bundle adjustment based on generated motion masks to obtain estimated poses in tracking process. Then, in dynamic-static composite mapping, we employ dynamic 4D Gaussians and static 3D Gaussians for separate modeling of dynamic and static parts. Concurrently, dynamic replay buffer and mapping keyframe are maintained for retrospective frame optimization, enabling accurate dynamic mapping.
  • Figure 2: Geometric-Prompt dynamic separation. The box prompt obtained through YOLO-world cheng2024yolo fails to detect the balloon, as 'balloon' is not included in its predefined semantic class categories. By applying our proposed $R_g$ and $R_p$, we successfully obtain comprehensive motion masks for all dynamic objects in the scene. Our method can segment dynamic objects by leveraging scene geometry information, regardless of object categories.
  • Figure 3: Visualization of our generated motion masks. Our method achieves accurate dynamic separation results in various dynamic scenes.
  • Figure 4: Qualitative comparison of scene reconstruction performance on Bonn dataset. People are walking in the scenes. Our method achieves high-quality dynamic reconstruction results in challenging dynamic scenarios, including complex scene with multiple people moving. Other SLAM methods fail to perform mapping in such dynamic scenes.
  • Figure 5: Qualitative comparison of scene reconstruction performance. The person is hitting a balloon with rapid up-and-down motions in the scene. Our method achieves accurate dynamic reconstruction compared with other SLAM methods.
  • ...and 2 more figures