Table of Contents
Fetching ...

Visual Odometry with Transformers

Vlardimir Yugay, Duy-Kien Nguyen, Theo Gevers, Cees G. M. Snoek, Martin R. Oswald

TL;DR

VoT tackles monocular visual odometry by replacing traditional optimization and hand-crafted components with an end-to-end Transformer-based pipeline. It uses a frozen pre-trained encoder to extract per-frame features and a time-space decoder to model inter-frame relationships, with a regression head that outputs relative poses; rotations are constrained by projecting to $SO(3)$ via Procrustes, and a weighted loss combines rotation and translation terms. The key contributions are: (i) end-to-end VO without bundle adjustment, (ii) strong speedups and scalable training with large data, and (iii) competitive generalization across indoor/outdoor datasets and unseen camera parameters, outperforming many large 3D models in speed and often in accuracy. This approach enables real-time, calibration-free VO suitable for AR/robotics applications and underscores the value of data-driven, geometry-aware transformers for pose estimation.

Abstract

Despite the rapid development of large 3D models, classical optimization-based approaches dominate the field of visual odometry (VO). Thus, current approaches to VO heavily rely on camera parameters and many handcrafted components, most of which involve complex bundle adjustment and feature-matching processes. Although disregarded in the literature, we find it problematic in terms of both (1) speed, that performs bundle adjustment requires a significant amount of time, and (2) scalability, as hand-crafted components struggle to learn from large-scale training data. In this work, we introduce a simple yet efficient architecture, Visual Odometry Transformer (VoT), that formulates monocular visual odometry as a direct relative pose regression problem. Our approach streamlines the monocular visual odometry pipeline in an end-to-end manner, effectively eliminating the need for handcrafted components such as bundle adjustment, feature matching, or camera calibration. We show that VoT is up to 4 times faster than traditional approaches, yet with competitive or better performance. Compared to recent 3D foundation models, VoT runs 10 times faster with strong scaling behavior in terms of both model sizes and training data. Moreover, VoT generalizes well in both low-data regimes and previously unseen scenarios, reducing the gap between optimization-based and end-to-end approaches.

Visual Odometry with Transformers

TL;DR

VoT tackles monocular visual odometry by replacing traditional optimization and hand-crafted components with an end-to-end Transformer-based pipeline. It uses a frozen pre-trained encoder to extract per-frame features and a time-space decoder to model inter-frame relationships, with a regression head that outputs relative poses; rotations are constrained by projecting to via Procrustes, and a weighted loss combines rotation and translation terms. The key contributions are: (i) end-to-end VO without bundle adjustment, (ii) strong speedups and scalable training with large data, and (iii) competitive generalization across indoor/outdoor datasets and unseen camera parameters, outperforming many large 3D models in speed and often in accuracy. This approach enables real-time, calibration-free VO suitable for AR/robotics applications and underscores the value of data-driven, geometry-aware transformers for pose estimation.

Abstract

Despite the rapid development of large 3D models, classical optimization-based approaches dominate the field of visual odometry (VO). Thus, current approaches to VO heavily rely on camera parameters and many handcrafted components, most of which involve complex bundle adjustment and feature-matching processes. Although disregarded in the literature, we find it problematic in terms of both (1) speed, that performs bundle adjustment requires a significant amount of time, and (2) scalability, as hand-crafted components struggle to learn from large-scale training data. In this work, we introduce a simple yet efficient architecture, Visual Odometry Transformer (VoT), that formulates monocular visual odometry as a direct relative pose regression problem. Our approach streamlines the monocular visual odometry pipeline in an end-to-end manner, effectively eliminating the need for handcrafted components such as bundle adjustment, feature matching, or camera calibration. We show that VoT is up to 4 times faster than traditional approaches, yet with competitive or better performance. Compared to recent 3D foundation models, VoT runs 10 times faster with strong scaling behavior in terms of both model sizes and training data. Moreover, VoT generalizes well in both low-data regimes and previously unseen scenarios, reducing the gap between optimization-based and end-to-end approaches.

Paper Structure

This paper contains 10 sections, 11 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Visual Odometry Transformer. The metric camera trajectory (top) is derived by passing intersecting windows of images through a feed-forward model that estimates relative camera poses between them, subsequently integrating these poses into a unified trajectory. VoT does not rely on camera parameters or test-time optimization. Moreover, our method is more than 3$\times$ faster than all the baselines.
  • Figure 2: VoT architecture. Given multiple input frames, a frozen image encoder extracts per-image token embeddings. Camera embeddings are then concatenated to aggregate the information for camera pose estimation. The embeddings are decoded by $L$ repeating decoder blocks with temporal and spatial attention modules. The rotations are projected onto the SO(3) manifold to ensure valid relative rotations.
  • Figure 3: Attention maps from the VoT decoder. Each row shows an original image with a selected query (red square), followed by attention maps from the four subsequent frames. To estimate relative camera pose, VoT attends to the related image regions, resembling the behavior of classical keypoint-based odometry methods.
  • Figure 4: Visualized Predicted Trajectories from Test Splits of ScanNet and ARKit Datasets. Most methods cannot predict a coherent trajectory without aligning to the ground truth. In contrast, VoT predicts an accurate trajectory without alignment.
  • Figure 5: Scaling behavior of VoT. As the model scales in (a) training data (proportion of ARKitScenes data added to ScanNet) and (b) model capacity (number of decoder layers), absolute translation and rotation errors decrease. This suggests that VoT exhibits robust scaling behavior.
  • ...and 2 more figures