Deep Patch Visual SLAM
Lahav Lipson, Zachary Teed, Jia Deng
TL;DR
DPV-SLAM addresses the need for robust monocular visual SLAM that runs on a single GPU with stable real-time performance. It extends the DPVO frontend with two loop-closure mechanisms—proximity-based and classical image-retrieval-based—and a CUDA-accelerated block-sparse bundle adjustment to unify odometry and loop-closure optimization within a patch-graph representation. The method achieves competitive accuracy compared to state-of-the-art deep SLAM on EuRoC, KITTI, and TartanAir, while delivering 2–4x speedups and reduced memory usage relative to prior deep SLAM systems. This approach broadens practical deployment of deep SLAM by providing a memory-efficient, real-time, domain-robust solution for monocular video with global consistency and open-source code.
Abstract
Recent work in visual SLAM has shown the effectiveness of using deep network backbones. Despite excellent accuracy, however, such approaches are often expensive to run or do not generalize well zero-shot. Their runtime can also fluctuate wildly while their frontend and backend fight for access to GPU resources. To address these problems, we introduce Deep Patch Visual (DPV) SLAM, a method for monocular visual SLAM on a single GPU. DPV-SLAM maintains a high minimum framerate and small memory overhead (5-7G) compared to existing deep SLAM systems. On real-world datasets, DPV-SLAM runs at 1x-4x real-time framerates. We achieve comparable accuracy to DROID-SLAM on EuRoC and TartanAir while running 2.5x faster using a fraction of the memory. DPV-SLAM is an extension to the DPVO visual odometry system; its code can be found in the same repository: https://github.com/princeton-vl/DPVO
