Table of Contents
Fetching ...

DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras

Zachary Teed, Jia Deng

TL;DR

DROID-SLAM presents a deep learning–driven SLAM system that jointly estimates camera motion and dense depth by iteratively refining poses and depths through a differentiable Dense Bundle Adjustment layer. Built on a RAFT-inspired foundation, it uses a frame-graph, dense per-pixel depth, and a ConvGRU-based update operator to converge to accurate, globally consistent trajectories, even when trained only with monocular data. The approach delivers state-of-the-art results across monocular, stereo, and RGB-D benchmarks, with exceptional robustness and generalization without retraining for new modalities. It runs in real time on powerful GPUs and offers a practical, versatile solution for robust visual SLAM in diverse environments.

Abstract

We introduce DROID-SLAM, a new deep learning based SLAM system. DROID-SLAM consists of recurrent iterative updates of camera pose and pixelwise depth through a Dense Bundle Adjustment layer. DROID-SLAM is accurate, achieving large improvements over prior work, and robust, suffering from substantially fewer catastrophic failures. Despite training on monocular video, it can leverage stereo or RGB-D video to achieve improved performance at test time. The URL to our open source code is https://github.com/princeton-vl/DROID-SLAM.

DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras

TL;DR

DROID-SLAM presents a deep learning–driven SLAM system that jointly estimates camera motion and dense depth by iteratively refining poses and depths through a differentiable Dense Bundle Adjustment layer. Built on a RAFT-inspired foundation, it uses a frame-graph, dense per-pixel depth, and a ConvGRU-based update operator to converge to accurate, globally consistent trajectories, even when trained only with monocular data. The approach delivers state-of-the-art results across monocular, stereo, and RGB-D benchmarks, with exceptional robustness and generalization without retraining for new modalities. It runs in real time on powerful GPUs and offers a practical, versatile solution for robust visual SLAM in diverse environments.

Abstract

We introduce DROID-SLAM, a new deep learning based SLAM system. DROID-SLAM consists of recurrent iterative updates of camera pose and pixelwise depth through a Dense Bundle Adjustment layer. DROID-SLAM is accurate, achieving large improvements over prior work, and robust, suffering from substantially fewer catastrophic failures. Despite training on monocular video, it can leverage stereo or RGB-D video to achieve improved performance at test time. The URL to our open source code is https://github.com/princeton-vl/DROID-SLAM.

Paper Structure

This paper contains 13 sections, 14 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: DROID-SLAM can operate on monocular, stereo, and RGB-D video. It builds a dense 3D map of the environment while simultaneously localizing the camera within the map.
  • Figure 2: Illustration of the update operator. The operator acts on edges in the frame graph, predicting flow revisions which are mapped to depth and pose update through the (DBA) layer.
  • Figure 3: DROID-SLAM can generalize to new datasets. In order, we show results from Tanks & Temples tanks, ScanNet scannet, Sintel sintel, and ETH-3D badslam; all using monocular video.
  • Figure 4: Generalization results on the RGB-D ETH3D-SLAM benchmark. (Left) Our method, which is trained only on the synthetic TartanAir dataset, ranks 1st on both the train and test splits. (Right) Plot of the number successful trajectories as a function of ATE. Our method successfully tracks 30/32 of the datasets where image data is available.
  • Figure 5: Visualizations of keyframe image, depth, flow and confidence estimates.
  • ...and 4 more figures