Table of Contents
Fetching ...

Deep Patch Visual SLAM

Lahav Lipson, Zachary Teed, Jia Deng

TL;DR

DPV-SLAM addresses the need for robust monocular visual SLAM that runs on a single GPU with stable real-time performance. It extends the DPVO frontend with two loop-closure mechanisms—proximity-based and classical image-retrieval-based—and a CUDA-accelerated block-sparse bundle adjustment to unify odometry and loop-closure optimization within a patch-graph representation. The method achieves competitive accuracy compared to state-of-the-art deep SLAM on EuRoC, KITTI, and TartanAir, while delivering 2–4x speedups and reduced memory usage relative to prior deep SLAM systems. This approach broadens practical deployment of deep SLAM by providing a memory-efficient, real-time, domain-robust solution for monocular video with global consistency and open-source code.

Abstract

Recent work in visual SLAM has shown the effectiveness of using deep network backbones. Despite excellent accuracy, however, such approaches are often expensive to run or do not generalize well zero-shot. Their runtime can also fluctuate wildly while their frontend and backend fight for access to GPU resources. To address these problems, we introduce Deep Patch Visual (DPV) SLAM, a method for monocular visual SLAM on a single GPU. DPV-SLAM maintains a high minimum framerate and small memory overhead (5-7G) compared to existing deep SLAM systems. On real-world datasets, DPV-SLAM runs at 1x-4x real-time framerates. We achieve comparable accuracy to DROID-SLAM on EuRoC and TartanAir while running 2.5x faster using a fraction of the memory. DPV-SLAM is an extension to the DPVO visual odometry system; its code can be found in the same repository: https://github.com/princeton-vl/DPVO

Deep Patch Visual SLAM

TL;DR

DPV-SLAM addresses the need for robust monocular visual SLAM that runs on a single GPU with stable real-time performance. It extends the DPVO frontend with two loop-closure mechanisms—proximity-based and classical image-retrieval-based—and a CUDA-accelerated block-sparse bundle adjustment to unify odometry and loop-closure optimization within a patch-graph representation. The method achieves competitive accuracy compared to state-of-the-art deep SLAM on EuRoC, KITTI, and TartanAir, while delivering 2–4x speedups and reduced memory usage relative to prior deep SLAM systems. This approach broadens practical deployment of deep SLAM by providing a memory-efficient, real-time, domain-robust solution for monocular video with global consistency and open-source code.

Abstract

Recent work in visual SLAM has shown the effectiveness of using deep network backbones. Despite excellent accuracy, however, such approaches are often expensive to run or do not generalize well zero-shot. Their runtime can also fluctuate wildly while their frontend and backend fight for access to GPU resources. To address these problems, we introduce Deep Patch Visual (DPV) SLAM, a method for monocular visual SLAM on a single GPU. DPV-SLAM maintains a high minimum framerate and small memory overhead (5-7G) compared to existing deep SLAM systems. On real-world datasets, DPV-SLAM runs at 1x-4x real-time framerates. We achieve comparable accuracy to DROID-SLAM on EuRoC and TartanAir while running 2.5x faster using a fraction of the memory. DPV-SLAM is an extension to the DPVO visual odometry system; its code can be found in the same repository: https://github.com/princeton-vl/DPVO
Paper Structure (10 sections, 7 equations, 10 figures, 4 tables)

This paper contains 10 sections, 7 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overview of DPV-SLAM. Our system is based on the odometry system from DPVO teed2023deep, and introduces two efficient loop closure mechanisms to correct for accumulated drift. Like DPVO, our system utilizes a patch graph scene representation, and alternates between predicting sparse optical flow residuals and optimizing the camera poses and depth using bundle adjustment. Our proximity loop closure detects loops using the pre-estimated geometry and jointly refines all variables together. Our classical loop-closure uses image retrieval and pose graph optimization.
  • Figure 2: Each directed edge in the patch graph introduces a highly-asymmetric memory overhead for the source patch and destination frame, respectively. Each edge is also associated with its own reprojection-error factor in the optimization which constrains both camera poses. This means we can flip their direction arbitrarily to influence which frames incur the memory cost, without significantly impacting the optimization result. This is a unique property of DPVO's patch representation and motivates the use of uni-directional edges; prior works used bi-directional (and dense) edges in their backends.
  • Figure 3: The patch graph for DPV-SLAM. We introduce directed edges from old frames to recent frames still in-use by the odometry component. These edges are chosen based on the camera's proximity to previously visited locations. The construction of this patch graph only requires storing a finite number of dense feature maps, keeping the overall memory consumption small.
  • Figure 4: We visualize the number of patches participating in the optimization over the coarse of a video. During invocations of our proximity loop-closure, we perform global bundle adjustment which updates a significant portion of patch depths. Here, we only consider patches with at least one high-confidence outgoing edge $(w > 0.5)$.
  • Figure 5: Drift estimation. After identifying a candidate image pair for loop closure using image retrieval, we seek to estimate the accumulated drift as a relative 7DOF transformation. Using off-the-shelf detectors and matchers, we estimate 2D correspondence from each retrieved image to its two temporal neighbors and perform structure-only bundle adjustment to triangulate their depth. Finally, we match between the resulting 3D keypoints and estimate a 7DOF point-cloud alignment with RANSAC+Umeyama umeyama1991least.
  • ...and 5 more figures