Table of Contents
Fetching ...

D$^3$FlowSLAM: Self-Supervised Dynamic SLAM with Flow Motion Decomposition and DINO Guidance

Xingyuan Yu, Weicai Ye, Xiyue Guo, Yuhang Ming, Jinyu Li, Hujun Bao, Zhaopeng Cui, Guofeng Zhang

TL;DR

D$^3$FlowSLAM tackles robust dense SLAM in highly dynamic scenes by decomposing optical flow into static and dynamic components via a dual-flow representation and updating pose, depth, and motion with a ConvGRU-based dynamic update module. It extends the DROID-SLAM foundation with a dense bundle adjustment layer and a self-supervised objective that leverages DINO-based foreground priors, artificial masks, and flow-guided losses to enable label-free training. The method achieves superior or competitive performance compared to self-supervised baselines and, in some cases, matches supervised methods across diverse dynamic datasets (e.g., VKITTI2, KITTI, TUM-RGBD) while maintaining end-to-end differentiability. This approach offers practical impact for AR/robotics in real-world dynamics, though it requires GPUs, careful hyperparameter tuning, and may not yet reach real-time performance on all sequences.

Abstract

In this paper, we introduce a self-supervised deep SLAM method that robustly operates in dynamic scenes while accurately identifying dynamic components. Our method leverages a dual-flow representation for static flow and dynamic flow, facilitating effective scene decomposition in dynamic environments. We propose a dynamic update module based on this representation and develop a dense SLAM system that excels in dynamic scenarios. In addition, we design a self-supervised training scheme using DINO as a prior, enabling label-free training. Our method achieves superior accuracy compared to other self-supervised methods. It also matches or even surpasses the performance of existing supervised methods in some cases. All code and data will be made publicly available upon acceptance.

D$^3$FlowSLAM: Self-Supervised Dynamic SLAM with Flow Motion Decomposition and DINO Guidance

TL;DR

DFlowSLAM tackles robust dense SLAM in highly dynamic scenes by decomposing optical flow into static and dynamic components via a dual-flow representation and updating pose, depth, and motion with a ConvGRU-based dynamic update module. It extends the DROID-SLAM foundation with a dense bundle adjustment layer and a self-supervised objective that leverages DINO-based foreground priors, artificial masks, and flow-guided losses to enable label-free training. The method achieves superior or competitive performance compared to self-supervised baselines and, in some cases, matches supervised methods across diverse dynamic datasets (e.g., VKITTI2, KITTI, TUM-RGBD) while maintaining end-to-end differentiability. This approach offers practical impact for AR/robotics in real-world dynamics, though it requires GPUs, careful hyperparameter tuning, and may not yet reach real-time performance on all sequences.

Abstract

In this paper, we introduce a self-supervised deep SLAM method that robustly operates in dynamic scenes while accurately identifying dynamic components. Our method leverages a dual-flow representation for static flow and dynamic flow, facilitating effective scene decomposition in dynamic environments. We propose a dynamic update module based on this representation and develop a dense SLAM system that excels in dynamic scenarios. In addition, we design a self-supervised training scheme using DINO as a prior, enabling label-free training. Our method achieves superior accuracy compared to other self-supervised methods. It also matches or even surpasses the performance of existing supervised methods in some cases. All code and data will be made publicly available upon acceptance.
Paper Structure (38 sections, 20 equations, 9 figures, 9 tables)

This paper contains 38 sections, 20 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: D$^3$FlowSLAM Overview. D$^3$FlowSLAM takes an image sequence as input, extracts features to construct a correlation volume, and then combines this with the initial static flow, dynamic flow, and dynamic mask before feeding it into the dynamic update module. This module iteratively optimizes the residuals of pose, inverse depth, static flow, dynamic flow, and dynamic mask, ultimately providing estimates of the camera pose, 3D structure, and dynamic decomposition results.
  • Figure 2: Dynamic Update Module. The correlation feature volume with current optical flow $\mathbf{F_{o_t}}$, static flow $\mathbf{F_{s_t}}$ and dynamic mask $\mathbf{M_{d_t}}$ will be fed to ConvGRU, getting different output splits. Adding dynamic mask residual $\mathbf{{\Delta M}_d}$ to the original mask yields new dynamic mask $\mathbf{M_{d_{t+1}}}$, which is simultaneously guided by the DINO prior and artificial prior created through dual-flow representation. Static flow residual $\mathbf{r_s}$ plus original static flow is fed into the dense bundle adjustment (DBA) layer that combines $\mathbf{M_{d_{t+1}}}$ and confidence weight to optimize the depth and pose. Finally, the new static and dynamic flow are summed to get the optical flow.
  • Figure 3: Reconstruction visualizations of our method. Our method can generalize to different datasets.
  • Figure 4: Flow decomposition visualization. From left to right: Input image, dynamic flow, and static flow.
  • Figure 5: Ablation for DINO Guidance. From left to right: Input image, prediction with DINO, and prediction without DINO. Colorful masks indicate dynamic parts.
  • ...and 4 more figures