Visual Odometry with Transformers
Vlardimir Yugay, Duy-Kien Nguyen, Theo Gevers, Cees G. M. Snoek, Martin R. Oswald
TL;DR
VoT tackles monocular visual odometry by replacing traditional optimization and hand-crafted components with an end-to-end Transformer-based pipeline. It uses a frozen pre-trained encoder to extract per-frame features and a time-space decoder to model inter-frame relationships, with a regression head that outputs relative poses; rotations are constrained by projecting to $SO(3)$ via Procrustes, and a weighted loss combines rotation and translation terms. The key contributions are: (i) end-to-end VO without bundle adjustment, (ii) strong speedups and scalable training with large data, and (iii) competitive generalization across indoor/outdoor datasets and unseen camera parameters, outperforming many large 3D models in speed and often in accuracy. This approach enables real-time, calibration-free VO suitable for AR/robotics applications and underscores the value of data-driven, geometry-aware transformers for pose estimation.
Abstract
Despite the rapid development of large 3D models, classical optimization-based approaches dominate the field of visual odometry (VO). Thus, current approaches to VO heavily rely on camera parameters and many handcrafted components, most of which involve complex bundle adjustment and feature-matching processes. Although disregarded in the literature, we find it problematic in terms of both (1) speed, that performs bundle adjustment requires a significant amount of time, and (2) scalability, as hand-crafted components struggle to learn from large-scale training data. In this work, we introduce a simple yet efficient architecture, Visual Odometry Transformer (VoT), that formulates monocular visual odometry as a direct relative pose regression problem. Our approach streamlines the monocular visual odometry pipeline in an end-to-end manner, effectively eliminating the need for handcrafted components such as bundle adjustment, feature matching, or camera calibration. We show that VoT is up to 4 times faster than traditional approaches, yet with competitive or better performance. Compared to recent 3D foundation models, VoT runs 10 times faster with strong scaling behavior in terms of both model sizes and training data. Moreover, VoT generalizes well in both low-data regimes and previously unseen scenarios, reducing the gap between optimization-based and end-to-end approaches.
