MAC-VO: Metrics-aware Covariance for Learning-based Stereo Visual Odometry
Yuheng Qiu, Yutian Chen, Zihao Zhang, Wenshan Wang, Sebastian Scherer
TL;DR
MAC-VO addresses robustness gaps in learning-based stereo visual odometry by introducing metrics-aware uncertainty to both 2D feature matching and 3D covariance propagation. A FlowFormer–based uncertainty network provides per-pixel matching uncertainty, which is used to filter keypoints and to construct a metrics-aware 3D covariance ${}^c\Sigma_{i,t}^p$ with inter-axis correlations for pose graph optimization. The two-frame backend minimizes pose error via a Mahalanobis distance using the full covariance, and initialization via TartanVO ensures robust starting estimates. Across EuRoC, KITTI, TartanAir v2, and ZED data, MAC-VO demonstrates state-of-the-art or near-state-of-the-art performance in challenging conditions, with a covariance map offering reliability guidance for autonomous-system decision making. The proposed covariance formulation includes depth uncertainty derived from disparity as $\sigma_d^2 \approx (bf_x\gamma)^2 / \mu_{Disp}^2$ and cross-axis terms such as $\text{Cov}(x,y) = \frac{\sigma_d^2}{f_x f_y}(u-c_x)(v-c_y)$, enabling informed weighting in back-end optimization.
Abstract
We propose the MAC-VO, a novel learning-based stereo VO that leverages the learned metrics-aware matching uncertainty for dual purposes: selecting keypoint and weighing the residual in pose graph optimization. Compared to traditional geometric methods prioritizing texture-affluent features like edges, our keypoint selector employs the learned uncertainty to filter out the low-quality features based on global inconsistency. In contrast to the learning-based algorithms that model the scale-agnostic diagonal weight matrix for covariance, we design a metrics-aware covariance model to capture the spatial error during keypoint registration and the correlations between different axes. Integrating this covariance model into pose graph optimization enhances the robustness and reliability of pose estimation, particularly in challenging environments with varying illumination, feature density, and motion patterns. On public benchmark datasets, MAC-VO outperforms existing VO algorithms and even some SLAM algorithms in challenging environments. The covariance map also provides valuable information about the reliability of the estimated poses, which can benefit decision-making for autonomous systems.
