Motion Consistency Loss for Monocular Visual Odometry with Attention-Based Deep Learning
André O. Françani, Marcos R. O. A. Maximo
TL;DR
This paper introduces a motion consistency loss for monocular visual odometry to exploit overlapping clips during training, enforcing agreement between the same motion estimated from different input windows. By combining this loss with the standard MSE objective, the authors show improved translational accuracy on the KITTI odometry benchmark, though scale drift in monocular VO remains a challenge. The approach builds on transformer-based spatio-temporal representations (divided space-time attention) to extract robust features from overlapped frame clips. Overall, the method provides a practical way to leverage temporal redundancy to enhance pose estimation, with potential extensions to integrate depth estimation for scale correction and end-to-end pose learning.
Abstract
Deep learning algorithms have driven expressive progress in many complex tasks. The loss function is a core component of deep learning techniques, guiding the learning process of neural networks. This paper contributes by introducing a consistency loss for visual odometry with deep learning-based approaches. The motion consistency loss explores repeated motions that appear in consecutive overlapped video clips. Experimental results show that our approach increased the performance of a model on the KITTI odometry benchmark.
