MambaVO: Deep Visual Odometry Based on Sequential Matching Refinement and Training Smoothing

Shuo Wang; Wanting Li; Yongcai Wang; Zhaoxin Fan; Zhe Huang; Xudong Cai; Jian Zhao; Deying Li

MambaVO: Deep Visual Odometry Based on Sequential Matching Refinement and Training Smoothing

Shuo Wang, Wanting Li, Yongcai Wang, Zhaoxin Fan, Zhe Huang, Xudong Cai, Jian Zhao, Deying Li

TL;DR

MambaVO tackles core weaknesses of deep visual odometry—unstable initialization, coarse inter-frame matching, and gradient variance during optimization—by introducing a Point-Frame Graph (PFG) and three components: Geometric Initialization Module (GIM) for robust pose init, Geometric Mamba Module (GMM) for sequential matching refinement, and Trending-Aware Penalty (TAP) for training stability. The approach is extended to MambaVO++ with loop closure for global SLAM optimization, yielding state-of-the-art accuracy on EuRoC, TUM-RGBD, KITTI, and TartanAir while maintaining real-time performance and lower memory usage. The method leverages semi-dense matching, multi-frame history fusion via Mamba blocks, and differentiable BA to optimize both poses and map points, with a GRU-based history update and a refinement head producing per-edge pixel updates and weights. Overall, MambaVO demonstrates robust, efficient visual odometry and SLAM-ready performance, representing a significant advance for learning-to-optimize VO in diverse, challenging environments.

Abstract

Deep visual odometry has demonstrated great advancements by learning-to-optimize technology. This approach heavily relies on the visual matching across frames. However, ambiguous matching in challenging scenarios leads to significant errors in geometric modeling and bundle adjustment optimization, which undermines the accuracy and robustness of pose estimation. To address this challenge, this paper proposes MambaVO, which conducts robust initialization, Mamba-based sequential matching refinement, and smoothed training to enhance the matching quality and improve the pose estimation. Specifically, the new frame is matched with the closest keyframe in the maintained Point-Frame Graph (PFG) via the semi-dense based Geometric Initialization Module (GIM). Then the initialized PFG is processed by a proposed Geometric Mamba Module (GMM), which exploits the matching features to refine the overall inter-frame matching. The refined PFG is finally processed by differentiable BA to optimize the poses and the map. To deal with the gradient variance, a Trending-Aware Penalty (TAP) is proposed to smooth training and enhance convergence and stability. A loop closure module is finally applied to enable MambaVO++. On public benchmarks, MambaVO and MambaVO++ demonstrate SOTA performance, while ensuring real-time running.

MambaVO: Deep Visual Odometry Based on Sequential Matching Refinement and Training Smoothing

TL;DR

Abstract

Paper Structure (25 sections, 14 equations, 6 figures, 7 tables)

This paper contains 25 sections, 14 equations, 6 figures, 7 tables.

Introduction
Related Work
Direct Pose Regression for Visual Odometry
Learning to Optimize for Visual Odometry
State Space Models
Methods
Geometric Initialization Module: GIM
Matching and Pose Initialization.
Matching Feature Preparation
Geometric Mamba Module: GMM
History Fusion
Geometric Mamba Blocks
Smoothed Training
Training Loss
Trending-Aware Penalty
...and 10 more sections

Figures (6)

Figure 1: The proposed MambaVO extracts Dino-v2oquab2023dinov2 features from the input RGB sequence and estimates the depthhu2024metric3d for keyframes. In the Geometric Initialization Module (\ref{['sec:initialization']}), a semi-dense matching network is utilized to generate initial matches, estimate the initial poses, and extract features for each match. Next, the Geometric Mamba Module(\ref{['sec:GMM']}) refines and re-weights the matching. Finally, we use a differentiable bundle adjustment (BA) to optimize the final poses, ensuring accuracy and stability in the pose estimation process.
Figure 2: Illustration of Geometric Initialization Module. GIM extracts geometric and context features, and performs matching using a semi-dense matching network. The initial pose is estimated using a PnP solver based on the matched pixels.
Figure 3: Illustration of Geometric Mamba Module. The GIM refines the matching by incorporating both current and historical information. Matching tokens $\mathbf M_t$ are derived from matching features and previous matching tokens, then processed through Mamba blocks. The RefineHead further decodes the refinement and the weight for each matching.
Figure 4: Qualitative visualization. The blue line represents the trajectory estimated by MambaVO, and the red line represents the ground truth. The results show that our estimated trajectory almost completely coincides with the ground truth.
Figure 5: We report the average ATE on the validation split of TartanAir. We observe that our loss design strategy makes training converge faster and achieve smaller ATE error.
...and 1 more figures

MambaVO: Deep Visual Odometry Based on Sequential Matching Refinement and Training Smoothing

TL;DR

Abstract

MambaVO: Deep Visual Odometry Based on Sequential Matching Refinement and Training Smoothing

Authors

TL;DR

Abstract

Table of Contents

Figures (6)