Table of Contents
Fetching ...

ORB-SfMLearner: ORB-Guided Self-supervised Visual Odometry with Selective Online Adaptation

Yanlin Jin, Rui-Yang Ju, Haojun Liu, Yuzhong Zhong

TL;DR

This work tackles the limited accuracy and generalization of monocular self-supervised visual odometry by introducing ORB-SfMLearner, which augments RGB inputs with ORB features and uses cross-attention in PoseNet to reveal how stable features guide ego-motion estimation. The method trains with self-supervised losses on depth and pose, and employs selective online adaptation at inference to rapidly tailor parameters to new scenes, improving robustness across domains. Key contributions include an effective ORB augmentation strategy, an interpretable ORB-guided attention mechanism, and a selective adaptation framework that enhances generalization, demonstrated on KITTI and vKITTI with state-of-the-art ego-motion accuracy. The approach offers practical impact for reliable monocular VO in changing conditions and deployments, with code available for replication.

Abstract

Deep visual odometry, despite extensive research, still faces limitations in accuracy and generalizability that prevent its broader application. To address these challenges, we propose an Oriented FAST and Rotated BRIEF (ORB)-guided visual odometry with selective online adaptation named ORB-SfMLearner. We present a novel use of ORB features for learning-based ego-motion estimation, leading to more robust and accurate results. We also introduce the cross-attention mechanism to enhance the explainability of PoseNet and have revealed that driving direction of the vehicle can be explained through the attention weights. To improve generalizability, our selective online adaptation allows the network to rapidly and selectively adjust to the optimal parameters across different domains. Experimental results on KITTI and vKITTI datasets show that our method outperforms previous state-of-the-art deep visual odometry methods in terms of ego-motion accuracy and generalizability. Code is available at https://github.com/PeaceNeil/ORB-SfMLearner

ORB-SfMLearner: ORB-Guided Self-supervised Visual Odometry with Selective Online Adaptation

TL;DR

This work tackles the limited accuracy and generalization of monocular self-supervised visual odometry by introducing ORB-SfMLearner, which augments RGB inputs with ORB features and uses cross-attention in PoseNet to reveal how stable features guide ego-motion estimation. The method trains with self-supervised losses on depth and pose, and employs selective online adaptation at inference to rapidly tailor parameters to new scenes, improving robustness across domains. Key contributions include an effective ORB augmentation strategy, an interpretable ORB-guided attention mechanism, and a selective adaptation framework that enhances generalization, demonstrated on KITTI and vKITTI with state-of-the-art ego-motion accuracy. The approach offers practical impact for reliable monocular VO in changing conditions and deployments, with code available for replication.

Abstract

Deep visual odometry, despite extensive research, still faces limitations in accuracy and generalizability that prevent its broader application. To address these challenges, we propose an Oriented FAST and Rotated BRIEF (ORB)-guided visual odometry with selective online adaptation named ORB-SfMLearner. We present a novel use of ORB features for learning-based ego-motion estimation, leading to more robust and accurate results. We also introduce the cross-attention mechanism to enhance the explainability of PoseNet and have revealed that driving direction of the vehicle can be explained through the attention weights. To improve generalizability, our selective online adaptation allows the network to rapidly and selectively adjust to the optimal parameters across different domains. Experimental results on KITTI and vKITTI datasets show that our method outperforms previous state-of-the-art deep visual odometry methods in terms of ego-motion accuracy and generalizability. Code is available at https://github.com/PeaceNeil/ORB-SfMLearner
Paper Structure (15 sections, 4 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 15 sections, 4 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: The pipeline of ORB-SfMLearner. The VO DepthNet estimates depth, while the PoseNet estimates the relative pose between two frames after fusing ORB and RGB features through a multi-head cross attention mechanism. The network is trained using the self-supervised reprojection error $\mathcal{L}_p$zhou2017unsupervised, geometry consistency error $\mathcal{L}_c$bian2019unsupervised, and depth map smoothness error $\mathcal{L}_s$godard2019digging. During inference, the network selectively performs online adaptation, learning to use the weights most suitable for the current scene, thereby achieving good generalization in challenging conditions, such as foggy weather.
  • Figure 2: Our augmented ORB data structure. For each original RGB image, we extract and form its ORB features as a 33-channel tensor. The first channel has the same dimension as original pictures but only feature points have value 1 to indicate key points positions. The other 32 channels store the ORB descriptors behind the key points. When in use, the two blocks of ORB tensors will be concatenated along the channel dimension. This method enables the representation of key points' positional information and potential matching relationships between detected key points in two images.
  • Figure 3: By visualizing the attention weights, a clear pattern emerges: during left or right turns, the regions with high weights also shift accordingly, often pointing towards the end of the road. The two columns on the left are selected from KITTI Odometry sequence 09, and the rightmost column is selected from sequence 07.
  • Figure 4: Qualitative results on the KITTI Odometry 02 07 08 and 09. Although the three compared methods adopt similar self-supervision and network designs, our method predicts a global trajectory that aligns more closely with the ground-truth, without experiencing trajectory drift over longer predictions.
  • Figure 5: We examine the impact of the difference between two frames on the attention weights. Setting the $n_{th}$ frame as the previous frame and the $k_{th}$ frame as the subsequent one, we observe that only with a moderate distance between n and k (i.e., moderate motion between frames), the highlighted areas point towards the distance of the road.