Table of Contents
Fetching ...

Deep Patch Visual Odometry

Zachary Teed, Lahav Lipson, Jia Deng

TL;DR

DPVO addresses the efficiency gap in deep monocular VO by replacing dense flow with sparse patch-based tracking. It introduces a recurrent update operator operating on a patch graph and a differentiable bundle adjustment layer to jointly refine patch depths and camera poses, trained end-to-end on synthetic data. The approach achieves state-of-the-art accuracy across multiple benchmarks while using substantially less memory and running at a constant, real-time frame rate (60–120 FPS). Its patch-based design yields robustness comparable to dense methods, with practical benefits for resource-constrained platforms and real-time SLAM-like applications.

Abstract

We propose Deep Patch Visual Odometry (DPVO), a new deep learning system for monocular Visual Odometry (VO). DPVO uses a novel recurrent network architecture designed for tracking image patches across time. Recent approaches to VO have significantly improved the state-of-the-art accuracy by using deep networks to predict dense flow between video frames. However, using dense flow incurs a large computational cost, making these previous methods impractical for many use cases. Despite this, it has been assumed that dense flow is important as it provides additional redundancy against incorrect matches. DPVO disproves this assumption, showing that it is possible to get the best accuracy and efficiency by exploiting the advantages of sparse patch-based matching over dense flow. DPVO introduces a novel recurrent update operator for patch based correspondence coupled with differentiable bundle adjustment. On Standard benchmarks, DPVO outperforms all prior work, including the learning-based state-of-the-art VO-system (DROID) using a third of the memory while running 3x faster on average. Code is available at https://github.com/princeton-vl/DPVO

Deep Patch Visual Odometry

TL;DR

DPVO addresses the efficiency gap in deep monocular VO by replacing dense flow with sparse patch-based tracking. It introduces a recurrent update operator operating on a patch graph and a differentiable bundle adjustment layer to jointly refine patch depths and camera poses, trained end-to-end on synthetic data. The approach achieves state-of-the-art accuracy across multiple benchmarks while using substantially less memory and running at a constant, real-time frame rate (60–120 FPS). Its patch-based design yields robustness comparable to dense methods, with practical benefits for resource-constrained platforms and real-time SLAM-like applications.

Abstract

We propose Deep Patch Visual Odometry (DPVO), a new deep learning system for monocular Visual Odometry (VO). DPVO uses a novel recurrent network architecture designed for tracking image patches across time. Recent approaches to VO have significantly improved the state-of-the-art accuracy by using deep networks to predict dense flow between video frames. However, using dense flow incurs a large computational cost, making these previous methods impractical for many use cases. Despite this, it has been assumed that dense flow is important as it provides additional redundancy against incorrect matches. DPVO disproves this assumption, showing that it is possible to get the best accuracy and efficiency by exploiting the advantages of sparse patch-based matching over dense flow. DPVO introduces a novel recurrent update operator for patch based correspondence coupled with differentiable bundle adjustment. On Standard benchmarks, DPVO outperforms all prior work, including the learning-based state-of-the-art VO-system (DROID) using a third of the memory while running 3x faster on average. Code is available at https://github.com/princeton-vl/DPVO
Paper Structure (17 sections, 9 equations, 20 figures, 4 tables)

This paper contains 17 sections, 9 equations, 20 figures, 4 tables.

Figures (20)

  • Figure 1: Deep Patch Visual Odometry (DPVO). Camera poses and a sparse 3D reconstruction (top) are obtained by iterative 2D revisions of patch trajectories through time.
  • Figure 2: Feature and Patch Extraction (Sec. \ref{['sec:patchextraction']}). Residual networks extract 1) a context feature map at $\frac{1}{4}$ resolution and 2) a 2-level pyramid of matching features at $\frac{1}{4}$ and $\frac{1}{8}$ resolution. Many $p\times p$ patches are cropped from this feature map at random pixel coordinates using bilinear sampling.
  • Figure 3: The patch graph. Edges connect patches with frames. Multiple patches are extracted from each frame (e.g. blue) and are connected to nearby frames (green and purple).
  • Figure 4: Schematic of the update operator. Correlation features are extracted from edges in the patch graph and injected into the hidden state alongside context features. We apply 1D convolution, message passing and a transition block. The factor head produces trajectory revisions which are used by the bundle adjustment layer to update the camera poses and the depth of patches. Each "+" operation is a residual connection followed by layer normalization.
  • Figure 5: A subset of the patch trajectories predicted by our method. Patches extracted from the green keyframe are tracked through subsequent frames. When a new keyframe is added (blue), additional patches are extracted and tracked. Our method produces confidence values which weight their respective contribution to the bundle adjustment.
  • ...and 15 more figures