Table of Contents
Fetching ...

Leveraging Consistent Spatio-Temporal Correspondence for Robust Visual Odometry

Zhaoxing Zhang, Junda Cheng, Gangwei Xu, Xiaoxiang Wang, Can Zhang, Xin Yang

TL;DR

This work tackles robustness and drift in visual odometry by leveraging spatio-temporal cues to improve multi-frame flow matching. It introduces STVO, a deep architecture with a Temporal Propagation Module and a Spatial Activation Module that mutually reinforce temporal and spatial consistency, integrated with a differentiable bundle adjustment backend. The approach achieves state-of-the-art results on TUM-RGBD, EuRoC MAV, ETH3D, and KITTI Odometry, including substantial improvements on ETH3D (77.8%) and KITTI (38.9%) over prior best methods. The findings highlight the importance of exploiting both spatial and temporal coherence in multi-frame flow estimation for robust, low-drift VO in challenging environments and long sequences.

Abstract

Recent approaches to VO have significantly improved performance by using deep networks to predict optical flow between video frames. However, existing methods still suffer from noisy and inconsistent flow matching, making it difficult to handle challenging scenarios and long-sequence estimation. To overcome these challenges, we introduce Spatio-Temporal Visual Odometry (STVO), a novel deep network architecture that effectively leverages inherent spatio-temporal cues to enhance the accuracy and consistency of multi-frame flow matching. With more accurate and consistent flow matching, STVO can achieve better pose estimation through the bundle adjustment (BA). Specifically, STVO introduces two innovative components: 1) the Temporal Propagation Module that utilizes multi-frame information to extract and propagate temporal cues across adjacent frames, maintaining temporal consistency; 2) the Spatial Activation Module that utilizes geometric priors from the depth maps to enhance spatial consistency while filtering out excessive noise and incorrect matches. Our STVO achieves state-of-the-art performance on TUM-RGBD, EuRoc MAV, ETH3D and KITTI Odometry benchmarks. Notably, it improves accuracy by 77.8% on ETH3D benchmark and 38.9% on KITTI Odometry benchmark over the previous best methods.

Leveraging Consistent Spatio-Temporal Correspondence for Robust Visual Odometry

TL;DR

This work tackles robustness and drift in visual odometry by leveraging spatio-temporal cues to improve multi-frame flow matching. It introduces STVO, a deep architecture with a Temporal Propagation Module and a Spatial Activation Module that mutually reinforce temporal and spatial consistency, integrated with a differentiable bundle adjustment backend. The approach achieves state-of-the-art results on TUM-RGBD, EuRoC MAV, ETH3D, and KITTI Odometry, including substantial improvements on ETH3D (77.8%) and KITTI (38.9%) over prior best methods. The findings highlight the importance of exploiting both spatial and temporal coherence in multi-frame flow estimation for robust, low-drift VO in challenging environments and long sequences.

Abstract

Recent approaches to VO have significantly improved performance by using deep networks to predict optical flow between video frames. However, existing methods still suffer from noisy and inconsistent flow matching, making it difficult to handle challenging scenarios and long-sequence estimation. To overcome these challenges, we introduce Spatio-Temporal Visual Odometry (STVO), a novel deep network architecture that effectively leverages inherent spatio-temporal cues to enhance the accuracy and consistency of multi-frame flow matching. With more accurate and consistent flow matching, STVO can achieve better pose estimation through the bundle adjustment (BA). Specifically, STVO introduces two innovative components: 1) the Temporal Propagation Module that utilizes multi-frame information to extract and propagate temporal cues across adjacent frames, maintaining temporal consistency; 2) the Spatial Activation Module that utilizes geometric priors from the depth maps to enhance spatial consistency while filtering out excessive noise and incorrect matches. Our STVO achieves state-of-the-art performance on TUM-RGBD, EuRoc MAV, ETH3D and KITTI Odometry benchmarks. Notably, it improves accuracy by 77.8% on ETH3D benchmark and 38.9% on KITTI Odometry benchmark over the previous best methods.

Paper Structure

This paper contains 28 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison of STVO with other influential visual odometry methods. Our STVO, denoted by red stars, achieves state-of-the-art performance in all benchmarks.
  • Figure 2: Diagram of Temporal Consistency and Spatial Consistency Across Multiple Frames
  • Figure 3: Overview of STVO. The architecture consists of three key modules: 1) Temporal Propagation Module, which enhances temporal consistency; 2) Spatial Activation Module, which maintains spatial consistency and filters out incorrect matches; 3) Differentiate Bundle Adjustment (DBA) Module, which updates poses and depths using optical flow estimates. The dashed lines in STVO indicate that the depth map input can be flexibly chosen between the depth generated by Depth Anything V2 and the output depth map of Bundle Adjustment. Both input options have demonstrated significant effectiveness.
  • Figure 4: Diagram of Temporal Propagation Module.
  • Figure 5: Diagram of Spatial Activation Module
  • ...and 2 more figures