Table of Contents
Fetching ...

Dynamic Visual SLAM using a General 3D Prior

Xingguang Zhong, Liren Jin, Marija Popović, Jens Behley, Cyrill Stachniss

TL;DR

Dynamic, monocular SLAM in real-world scenes is challenged by moving objects. The authors fuse a patch-based bundle adjustment with a feed-forward 3D reconstruction model (pi^3_mos) to filter dynamic regions, estimate depth with scale alignment, and integrate depth priors via an uncertainty-aware BA. The approach achieves robust camera tracking and scale-consistent depth across dynamic sequences, outperforming online baselines and approaching offline methods. The work advances online dynamic SLAM by leveraging learned priors and principled uncertainty weighting.

Abstract

Reliable incremental estimation of camera poses and 3D reconstruction is key to enable various applications including robotics, interactive visualization, and augmented reality. However, this task is particularly challenging in dynamic natural environments, where scene dynamics can severely deteriorate camera pose estimation accuracy. In this work, we propose a novel monocular visual SLAM system that can robustly estimate camera poses in dynamic scenes. To this end, we leverage the complementary strengths of geometric patch-based online bundle adjustment and recent feed-forward reconstruction models. Specifically, we propose a feed-forward reconstruction model to precisely filter out dynamic regions, while also utilizing its depth prediction to enhance the robustness of the patch-based visual SLAM. By aligning depth prediction with estimated patches from bundle adjustment, we robustly handle the inherent scale ambiguities of the batch-wise application of the feed-forward reconstruction model.

Dynamic Visual SLAM using a General 3D Prior

TL;DR

Dynamic, monocular SLAM in real-world scenes is challenged by moving objects. The authors fuse a patch-based bundle adjustment with a feed-forward 3D reconstruction model (pi^3_mos) to filter dynamic regions, estimate depth with scale alignment, and integrate depth priors via an uncertainty-aware BA. The approach achieves robust camera tracking and scale-consistent depth across dynamic sequences, outperforming online baselines and approaching offline methods. The work advances online dynamic SLAM by leveraging learned priors and principled uncertainty weighting.

Abstract

Reliable incremental estimation of camera poses and 3D reconstruction is key to enable various applications including robotics, interactive visualization, and augmented reality. However, this task is particularly challenging in dynamic natural environments, where scene dynamics can severely deteriorate camera pose estimation accuracy. In this work, we propose a novel monocular visual SLAM system that can robustly estimate camera poses in dynamic scenes. To this end, we leverage the complementary strengths of geometric patch-based online bundle adjustment and recent feed-forward reconstruction models. Specifically, we propose a feed-forward reconstruction model to precisely filter out dynamic regions, while also utilizing its depth prediction to enhance the robustness of the patch-based visual SLAM. By aligning depth prediction with estimated patches from bundle adjustment, we robustly handle the inherent scale ambiguities of the batch-wise application of the feed-forward reconstruction model.

Paper Structure

This paper contains 15 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Given an input image sequence, our SLAM system tightly integrating a conventional geometric pipeline with a feed-forward reconstruction model can incrementally perform moving object segmentation (red masks), scale-consistent depth estimation, and camera pose estimation.
  • Figure 2: Network architecture of $\pi^3_{\text{mos}}$. After patchifying input images into tokens using DINOv2, we apply alternating frame-wise and global self-attention. The fused tokens are fed into three prediction heads: a depth head for estimating depth, a camera head for predicting relative poses, and a Moving object segmentation (MOS) head for producing motion probabilities. We highlight the MOS head as the key addition compared to $\pi^3$.
  • Figure 3: Overview of our SLAM system. We highlight the current frame with a red outline. It is processed together with a set of selected historical frames by our $\pi^3_{\text{mos}}$ model, which outputs per-frame motion probabilities and depth predictions. We use the current frame’s motion probabilities to initialize new patches only on the static background. The depth maps of the historical frames are used to align the scale of the current prediction via patch registration. After scale alignment, the refined depth and the optical flow constraints are fed into our BA module to jointly optimize camera poses and patch depths.
  • Figure 4: Qualitative comparison of moving object segmentation on the DAVIS-16 dataset. We show the moving object masks predicted from Easi3R$_{\text{MonST3R}}$ and our $\pi^3_{\text{mos}}$ model. Our method produces more accurate segmentation results, particularly in challenging scenarios with motion blur and thin structures.
  • Figure 5: Qualitative results on different datasets. We show the sparse patch point cloud map, the dense point cloud map accumulated from aligned depth predicted by $\pi^3_{\text{mos}}$, and the input data containing various dynamic objects.