Table of Contents
Fetching ...

DINO-VO: Learning Where to Focus for Enhanced State Estimation

Qi Chen, Guanghao Li, Sijia Hu, Xin Gao, Junpeng Ma, Xiangyang Xue, Jian Pu

Abstract

We present DINO Patch Visual Odometry (DINO-VO), an end-to-end monocular visual odometry system with strong scene generalization. Current Visual Odometry (VO) systems often rely on heuristic feature extraction strategies, which can degrade accuracy and robustness, particularly in large-scale outdoor environments. DINO-VO addresses these limitations by incorporating a differentiable adaptive patch selector into the end-to-end pipeline, improving the quality of extracted patches and enhancing generalization across diverse datasets. Additionally, our system integrates a multi-task feature extraction module with a differentiable bundle adjustment (BA) module that leverages inverse depth priors, enabling the system to learn and utilize appearance and geometric information effectively. This integration bridges the gap between feature learning and state estimation. Extensive experiments on the TartanAir, KITTI, Euroc, and TUM datasets demonstrate that DINO-VO exhibits strong generalization across synthetic, indoor, and outdoor environments, achieving state-of-the-art tracking accuracy.

DINO-VO: Learning Where to Focus for Enhanced State Estimation

Abstract

We present DINO Patch Visual Odometry (DINO-VO), an end-to-end monocular visual odometry system with strong scene generalization. Current Visual Odometry (VO) systems often rely on heuristic feature extraction strategies, which can degrade accuracy and robustness, particularly in large-scale outdoor environments. DINO-VO addresses these limitations by incorporating a differentiable adaptive patch selector into the end-to-end pipeline, improving the quality of extracted patches and enhancing generalization across diverse datasets. Additionally, our system integrates a multi-task feature extraction module with a differentiable bundle adjustment (BA) module that leverages inverse depth priors, enabling the system to learn and utilize appearance and geometric information effectively. This integration bridges the gap between feature learning and state estimation. Extensive experiments on the TartanAir, KITTI, Euroc, and TUM datasets demonstrate that DINO-VO exhibits strong generalization across synthetic, indoor, and outdoor environments, achieving state-of-the-art tracking accuracy.

Paper Structure

This paper contains 27 sections, 7 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The overview of the system. Three modules establish our system from left to right: Multi-task Feature Extractor, Adaptive Patch Selector, and Sparse Bundle Adjustment Layer. The Multi-task Feature Extractor extracts the corresponding features for matching, selecting, and adjusting the bundle. The Adaptive Patch Selector selects high-weight patch features for bundle adjustment. The Sparse Bundle Adjustment Layer performs bundle adjustment to optimize the pose in the factor graph.
  • Figure 2: Architecture of the Multi-task Feature Extractor. The module predicts context features, matching features, an inverse depth map, and a prior weight map from a single RGB image.
  • Figure 3: Pipeline for the adaptive patch selector. We utilize the prior weight and depth maps to uniformly select high-weight patches, which is more useful for the next sparse bundle adjustment.
  • Figure 4: Comparison of our adaptive patch selector with existing systems. The first row is the input image from different datasets. The second row presents the random patch selection strategy proposed in dpvo and dpslam, while the third row illustrates our patch selection strategy, highlighting its focus on areas that contribute significantly to bundle adjustment.
  • Figure 5: Comparison of reconstruction results on TartanAir tartanair. Our sparse map is more informative than DPVO's, particularly in the texture-rich area, as shown in the three small images on the right side.
  • ...and 2 more figures