MambaVO: Deep Visual Odometry Based on Sequential Matching Refinement and Training Smoothing
Shuo Wang, Wanting Li, Yongcai Wang, Zhaoxin Fan, Zhe Huang, Xudong Cai, Jian Zhao, Deying Li
TL;DR
MambaVO tackles core weaknesses of deep visual odometry—unstable initialization, coarse inter-frame matching, and gradient variance during optimization—by introducing a Point-Frame Graph (PFG) and three components: Geometric Initialization Module (GIM) for robust pose init, Geometric Mamba Module (GMM) for sequential matching refinement, and Trending-Aware Penalty (TAP) for training stability. The approach is extended to MambaVO++ with loop closure for global SLAM optimization, yielding state-of-the-art accuracy on EuRoC, TUM-RGBD, KITTI, and TartanAir while maintaining real-time performance and lower memory usage. The method leverages semi-dense matching, multi-frame history fusion via Mamba blocks, and differentiable BA to optimize both poses and map points, with a GRU-based history update and a refinement head producing per-edge pixel updates and weights. Overall, MambaVO demonstrates robust, efficient visual odometry and SLAM-ready performance, representing a significant advance for learning-to-optimize VO in diverse, challenging environments.
Abstract
Deep visual odometry has demonstrated great advancements by learning-to-optimize technology. This approach heavily relies on the visual matching across frames. However, ambiguous matching in challenging scenarios leads to significant errors in geometric modeling and bundle adjustment optimization, which undermines the accuracy and robustness of pose estimation. To address this challenge, this paper proposes MambaVO, which conducts robust initialization, Mamba-based sequential matching refinement, and smoothed training to enhance the matching quality and improve the pose estimation. Specifically, the new frame is matched with the closest keyframe in the maintained Point-Frame Graph (PFG) via the semi-dense based Geometric Initialization Module (GIM). Then the initialized PFG is processed by a proposed Geometric Mamba Module (GMM), which exploits the matching features to refine the overall inter-frame matching. The refined PFG is finally processed by differentiable BA to optimize the poses and the map. To deal with the gradient variance, a Trending-Aware Penalty (TAP) is proposed to smooth training and enhance convergence and stability. A loop closure module is finally applied to enable MambaVO++. On public benchmarks, MambaVO and MambaVO++ demonstrate SOTA performance, while ensuring real-time running.
