Table of Contents
Fetching ...

Dropping the D: RGB-D SLAM Without the Depth Sensor

Mert Kiray, Alican Karaomer, Benjamin Busam

TL;DR

DropD-SLAM presents a real-time monocular SLAM system that achieves RGB-D-level accuracy without depth sensors by fusing pretrained monocular depth, learned keypoints, and instance segmentation into a modular front end. The static features are metrically scaled via depth priors and backprojected to 3D, then processed by an unmodified RGB-D back end, yielding precise tracking and mapping in both static and dynamic indoor scenes. On the TUM RGB-D benchmark, the method delivers state-of-the-art performance in dynamic environments (about 1.8 cm ATE) and competitive static results (around 7–8 cm ATE), while running at ~22 FPS on a single GPU. This work demonstrates that modern pretrained vision models can replace active depth sensors, offering a simpler, cost-effective pathway toward robust, metric-scale SLAM with zero-shot deployment.

Abstract

We present DropD-SLAM, a real-time monocular SLAM system that achieves RGB-D-level accuracy without relying on depth sensors. The system replaces active depth input with three pretrained vision modules: a monocular metric depth estimator, a learned keypoint detector, and an instance segmentation network. Dynamic objects are suppressed using dilated instance masks, while static keypoints are assigned predicted depth values and backprojected into 3D to form metrically scaled features. These are processed by an unmodified RGB-D SLAM back end for tracking and mapping. On the TUM RGB-D benchmark, DropD-SLAM attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences, matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS on a single GPU. These results suggest that modern pretrained vision models can replace active depth sensors as reliable, real-time sources of metric scale, marking a step toward simpler and more cost-effective SLAM systems.

Dropping the D: RGB-D SLAM Without the Depth Sensor

TL;DR

DropD-SLAM presents a real-time monocular SLAM system that achieves RGB-D-level accuracy without depth sensors by fusing pretrained monocular depth, learned keypoints, and instance segmentation into a modular front end. The static features are metrically scaled via depth priors and backprojected to 3D, then processed by an unmodified RGB-D back end, yielding precise tracking and mapping in both static and dynamic indoor scenes. On the TUM RGB-D benchmark, the method delivers state-of-the-art performance in dynamic environments (about 1.8 cm ATE) and competitive static results (around 7–8 cm ATE), while running at ~22 FPS on a single GPU. This work demonstrates that modern pretrained vision models can replace active depth sensors, offering a simpler, cost-effective pathway toward robust, metric-scale SLAM with zero-shot deployment.

Abstract

We present DropD-SLAM, a real-time monocular SLAM system that achieves RGB-D-level accuracy without relying on depth sensors. The system replaces active depth input with three pretrained vision modules: a monocular metric depth estimator, a learned keypoint detector, and an instance segmentation network. Dynamic objects are suppressed using dilated instance masks, while static keypoints are assigned predicted depth values and backprojected into 3D to form metrically scaled features. These are processed by an unmodified RGB-D SLAM back end for tracking and mapping. On the TUM RGB-D benchmark, DropD-SLAM attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences, matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS on a single GPU. These results suggest that modern pretrained vision models can replace active depth sensors as reliable, real-time sources of metric scale, marking a step toward simpler and more cost-effective SLAM systems.

Paper Structure

This paper contains 23 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 2: Overview of DropD-SLAM. Each RGB frame is processed in parallel by three pretrained modules: (i) metric depth estimation, (ii) instance segmentation, and (iii) keypoint detection. Instance masks filter dynamic objects, and static keypoints are backprojected with predicted depth to form metrically scaled 3D features. These are passed to an unmodified RGB-D SLAM back end for tracking, mapping, and loop closure in real time.
  • Figure 3: Feature detection comparison on a dynamic frame. ORB rublee2011orb features cluster in textured areas, whereas Key.Net barroso2019key produces a more uniform distribution, improving robustness under low texture and motion.
  • Figure 4: Depth predictions from different models, individually rescaled for visualization. Differences highlight variations in relative structure and spatial consistency.
  • Figure 5: Depth error maps for different models. Red regions indicate overestimation, blue indicates underestimation relative to ground truth.