Motion Consistency Loss for Monocular Visual Odometry with Attention-Based Deep Learning

André O. Françani; Marcos R. O. A. Maximo

Motion Consistency Loss for Monocular Visual Odometry with Attention-Based Deep Learning

André O. Françani, Marcos R. O. A. Maximo

TL;DR

This paper introduces a motion consistency loss for monocular visual odometry to exploit overlapping clips during training, enforcing agreement between the same motion estimated from different input windows. By combining this loss with the standard MSE objective, the authors show improved translational accuracy on the KITTI odometry benchmark, though scale drift in monocular VO remains a challenge. The approach builds on transformer-based spatio-temporal representations (divided space-time attention) to extract robust features from overlapped frame clips. Overall, the method provides a practical way to leverage temporal redundancy to enhance pose estimation, with potential extensions to integrate depth estimation for scale correction and end-to-end pose learning.

Abstract

Deep learning algorithms have driven expressive progress in many complex tasks. The loss function is a core component of deep learning techniques, guiding the learning process of neural networks. This paper contributes by introducing a consistency loss for visual odometry with deep learning-based approaches. The motion consistency loss explores repeated motions that appear in consecutive overlapped video clips. Experimental results show that our approach increased the performance of a model on the KITTI odometry benchmark.

Motion Consistency Loss for Monocular Visual Odometry with Attention-Based Deep Learning

TL;DR

Abstract

Paper Structure (12 sections, 6 equations, 4 figures, 2 tables)

This paper contains 12 sections, 6 equations, 4 figures, 2 tables.

Introduction
Background
Monocular visual odometry
Time-space attention in monocular visual odometry
Proposed Method
Experimental Setup
Model setup
Training setup
Dataset
Evaluation metrics
Experimental results
Conclusion

Figures (4)

Figure 1: Camera's motion between consecutive time steps.
Figure 2: Encoder block with "divided space-time" self-attention.
Figure 3: Schematic representation of the proposed method. Input clips with overlapped frames go to a model that estimates the camera's motion between consecutive time steps. During the training step, the motion consistency loss is calculated using the estimated motions that appear in all input clips ($\hat{\mathbf{T}}_2$ in this example). The motion consistency loss is weighted by a hyperparameter $\alpha$, and this result is added to the MSE of all estimations. The final loss propagates back to the model during the training step.
Figure 4: Trajectories obtained by model A( 0.25ex0.4ex 0.2ex $\blacksquare \space \blacksquare$ ), model B ( 0.25ex0.4ex 0.2ex $\blacksquare \space \blacksquare$ ), and model C ( 0.25ex0.4ex 0.2ex $\blacksquare \space \blacksquare$ ), compared with the ground truth ( 0.25ex0.4ex 0.2ex $\blacksquare \space \blacksquare$ ). All depicted sequences belong to the test set, but sequence 09. Trajectories are obtained under the 7-DoF alignment.

Motion Consistency Loss for Monocular Visual Odometry with Attention-Based Deep Learning

TL;DR

Abstract

Motion Consistency Loss for Monocular Visual Odometry with Attention-Based Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)