D$^3$FlowSLAM: Self-Supervised Dynamic SLAM with Flow Motion Decomposition and DINO Guidance
Xingyuan Yu, Weicai Ye, Xiyue Guo, Yuhang Ming, Jinyu Li, Hujun Bao, Zhaopeng Cui, Guofeng Zhang
TL;DR
D$^3$FlowSLAM tackles robust dense SLAM in highly dynamic scenes by decomposing optical flow into static and dynamic components via a dual-flow representation and updating pose, depth, and motion with a ConvGRU-based dynamic update module. It extends the DROID-SLAM foundation with a dense bundle adjustment layer and a self-supervised objective that leverages DINO-based foreground priors, artificial masks, and flow-guided losses to enable label-free training. The method achieves superior or competitive performance compared to self-supervised baselines and, in some cases, matches supervised methods across diverse dynamic datasets (e.g., VKITTI2, KITTI, TUM-RGBD) while maintaining end-to-end differentiability. This approach offers practical impact for AR/robotics in real-world dynamics, though it requires GPUs, careful hyperparameter tuning, and may not yet reach real-time performance on all sequences.
Abstract
In this paper, we introduce a self-supervised deep SLAM method that robustly operates in dynamic scenes while accurately identifying dynamic components. Our method leverages a dual-flow representation for static flow and dynamic flow, facilitating effective scene decomposition in dynamic environments. We propose a dynamic update module based on this representation and develop a dense SLAM system that excels in dynamic scenarios. In addition, we design a self-supervised training scheme using DINO as a prior, enabling label-free training. Our method achieves superior accuracy compared to other self-supervised methods. It also matches or even surpasses the performance of existing supervised methods in some cases. All code and data will be made publicly available upon acceptance.
