Table of Contents
Fetching ...

Salient Sparse Visual Odometry With Pose-Only Supervision

Siyu Chen, Kangcheng Liu, Chen Wang, Shenghai Yuan, Jianfei Yang, Lihua Xie

TL;DR

The paper tackles robust visual odometry under challenging lighting and motion conditions while reducing labeling burden. It introduces a pose-only supervised hybrid VO that bootstraps optical-flow learning through self-supervised homography pre-training and employs a salient patch-based sparse flow estimator paired with a weighted bundle adjustment layer. Key contributions include the salient patches strategy, the homography-based pre-training, and the patch refinement module, with strong generalization demonstrated across TartanAir, EuRoC, TUM, and OIVIO, plus a real-world robustness test. The proposed approach achieves competitive accuracy and superior robustness in unseen scenarios, offering a practical solution for autonomous systems requiring reliable VO without dense optical-flow supervision.

Abstract

Visual Odometry (VO) is vital for the navigation of autonomous systems, providing accurate position and orientation estimates at reasonable costs. While traditional VO methods excel in some conditions, they struggle with challenges like variable lighting and motion blur. Deep learning-based VO, though more adaptable, can face generalization problems in new environments. Addressing these drawbacks, this paper presents a novel hybrid visual odometry (VO) framework that leverages pose-only supervision, offering a balanced solution between robustness and the need for extensive labeling. We propose two cost-effective and innovative designs: a self-supervised homographic pre-training for enhancing optical flow learning from pose-only labels and a random patch-based salient point detection strategy for more accurate optical flow patch extraction. These designs eliminate the need for dense optical flow labels for training and significantly improve the generalization capability of the system in diverse and challenging environments. Our pose-only supervised method achieves competitive performance on standard datasets and greater robustness and generalization ability in extreme and unseen scenarios, even compared to dense optical flow-supervised state-of-the-art methods.

Salient Sparse Visual Odometry With Pose-Only Supervision

TL;DR

The paper tackles robust visual odometry under challenging lighting and motion conditions while reducing labeling burden. It introduces a pose-only supervised hybrid VO that bootstraps optical-flow learning through self-supervised homography pre-training and employs a salient patch-based sparse flow estimator paired with a weighted bundle adjustment layer. Key contributions include the salient patches strategy, the homography-based pre-training, and the patch refinement module, with strong generalization demonstrated across TartanAir, EuRoC, TUM, and OIVIO, plus a real-world robustness test. The proposed approach achieves competitive accuracy and superior robustness in unseen scenarios, offering a practical solution for autonomous systems requiring reliable VO without dense optical-flow supervision.

Abstract

Visual Odometry (VO) is vital for the navigation of autonomous systems, providing accurate position and orientation estimates at reasonable costs. While traditional VO methods excel in some conditions, they struggle with challenges like variable lighting and motion blur. Deep learning-based VO, though more adaptable, can face generalization problems in new environments. Addressing these drawbacks, this paper presents a novel hybrid visual odometry (VO) framework that leverages pose-only supervision, offering a balanced solution between robustness and the need for extensive labeling. We propose two cost-effective and innovative designs: a self-supervised homographic pre-training for enhancing optical flow learning from pose-only labels and a random patch-based salient point detection strategy for more accurate optical flow patch extraction. These designs eliminate the need for dense optical flow labels for training and significantly improve the generalization capability of the system in diverse and challenging environments. Our pose-only supervised method achieves competitive performance on standard datasets and greater robustness and generalization ability in extreme and unseen scenarios, even compared to dense optical flow-supervised state-of-the-art methods.
Paper Structure (19 sections, 11 equations, 5 figures, 8 tables)

This paper contains 19 sections, 11 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Structure of our proposed method. Our method employs a CNN to extract features and patches from the salient patches extraction module. These patches are reprojected to neighboring frames by using the estimated poses and depth, and the correlation map is computed with the neighbor features of the reprojected positions. The correlation map, along with patch context information, is fed into the flow estimation network to get the optical flow and confidence weights. Then, the weighted bundle adjustment layer is applied to get poses and patch depths. This sequence—reprojection, correlation map computation, flow estimation, and bundle adjustment—is iterated N times to get the final poses and depth.
  • Figure 2: The illustration of the self-supervised training process. The green triangle and circles denote the salient patches and the randomly selected patches, respectively. The yellow squares denote the estimated flow. The corresponding points are obtained by homographic adaption as the ground truth for the flow training and the feature training.
  • Figure 3: The comparisons of the random selection patches. The green square means the selected patches. The first line shows the random patches selection strategy of DPVO dpvo and the second line shows the salient patches selection strategy of our method. Our method can provide more meaningful and even patches compared with the random selection strategy.
  • Figure 4: The visualizations of the patches tracking and the corresponding confidence weights, and the correlation map. The first row displays tracked patches alongside their corresponding confidence weights (bluer tones signify higher scores, whereas redder tones denote lower scores). Rows two through four depict the attention map, with darker shades representing lower attention scores while lighter shades signifying higher scores. The magnified content from the square-boxed area can be found in the top-left corner.
  • Figure 5: A real-world dome in a meeting room with significant illumination changing to compare the generalization ability of different methods. The approximate trajectory involved walking around the table in a nearly identical path for two loops and the initial path and the end path are roughly aligned. For lack of the ground truth of the sequence, we present the trajectory in four images and assess the performances based on the degree of overlapping of the trajectory of the initial and final phases.