DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras
Zachary Teed, Jia Deng
TL;DR
DROID-SLAM presents a deep learning–driven SLAM system that jointly estimates camera motion and dense depth by iteratively refining poses and depths through a differentiable Dense Bundle Adjustment layer. Built on a RAFT-inspired foundation, it uses a frame-graph, dense per-pixel depth, and a ConvGRU-based update operator to converge to accurate, globally consistent trajectories, even when trained only with monocular data. The approach delivers state-of-the-art results across monocular, stereo, and RGB-D benchmarks, with exceptional robustness and generalization without retraining for new modalities. It runs in real time on powerful GPUs and offers a practical, versatile solution for robust visual SLAM in diverse environments.
Abstract
We introduce DROID-SLAM, a new deep learning based SLAM system. DROID-SLAM consists of recurrent iterative updates of camera pose and pixelwise depth through a Dense Bundle Adjustment layer. DROID-SLAM is accurate, achieving large improvements over prior work, and robust, suffering from substantially fewer catastrophic failures. Despite training on monocular video, it can leverage stereo or RGB-D video to achieve improved performance at test time. The URL to our open source code is https://github.com/princeton-vl/DROID-SLAM.
