Table of Contents
Fetching ...

DROID-SLAM in the Wild

Moyang Li, Zihan Zhu, Marc Pollefeys, Daniel Barath

Abstract

We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 10 FPS. Code and datasets are available at https://github.com/MoyangLi00/DROID-W.git.

DROID-SLAM in the Wild

Abstract

We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 10 FPS. Code and datasets are available at https://github.com/MoyangLi00/DROID-W.git.
Paper Structure (35 sections, 15 equations, 10 figures, 10 tables)

This paper contains 35 sections, 15 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: DROID-W. Given a casually captured in-the-wild video, our method estimates accurate dynamic uncertainty, camera trajectory, and scene structure, where existing SLAM baselines fail. Left: frames of the input video. Middle: reconstructed dynamic point clouds with estimated camera poses. Right: overlay of optimized uncertainty on the corresponding input frames.
  • Figure 2: System Overview. The proposed DROID-W takes a sequence of RGB images as inputs and simultaneously estimates camera poses while recovering scene geometry. It alternatingly performs pose-depth refinement and uncertainty optimization in an iterative manner. The proposed uncertainty-aware dense bundle adjustment weights reprojection residuals with per-pixel uncertainty $\mathbf{u}$ to mitigate the influence of dynamic distractors. In addition, we use predicted monocular depth ${\mathbf{D}}$ as regularization of bundle adjustment, to improve its robustness under highly dynamic environments. For the uncertainty optimization module, we first extract DINOv2 oquab2023dinov2 features from the input images and then iteratively update the dynamic uncertainty map by leveraging multi-view feature consistency. Specifically, feature consistency is measured by the cosine similarity between features of image $\mathbf{I}_i$ and its corresponding features in image $\mathbf{I}_j$, where the rigid-motion correspondences $\mathbf{p}_{ij}$ are derived using the current pose and depth estimates.
  • Figure 3: Uncertainty Estimation. WildGS-SLAM zheng2025wildgs and our approach estimate dynamic uncertainty, whereas MonST3R zhang2024monst3r predicts a binary motion mask. Our approach produces more accurate and spatially consistent uncertainty estimations across all challenging sequences.
  • Figure 4: 3D Reconstruction Comparisons on YouTube Sequences. We compare 3D reconstruction quality of DROID-SLAM teed2021droid, WildGS-SLAM zheng2025wildgs, and our method. Point clouds from DROID-SLAM and ours are visualized directly, while Gaussian renderings from WildGS-SLAM are displayed using the 3DGS viewer. WildGS-SLAM fails on most sequences. DROID-SLAM shows obvious scale drift (St. Moritz 1), inaccurate geometry (St. Moritz 3), and noisy distractors (Tokyo Walking 2 & 3) under challenging dynamic environments. Our approach produces accurate and consistent reconstructions across highly dynamic and visually challenging real-world sequences.
  • Figure 5: Uncertainty Estimation.
  • ...and 5 more figures