Table of Contents
Fetching ...

Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach

André O. Françani, Marcos R. O. A. Maximo

TL;DR

We address monocular visual odometry by reframing it as a video-understanding problem and propose TSformer-VO, a Transformer-based model that uses divided space-time self-attention to estimate the camera's $6$-DoF poses from RGB clips in an end-to-end fashion. The method converts absolute ground-truth poses to relative motions, represents rotations with Euler angles, and regresses $(N_f-1)$ poses per clip with a TimeSformer-inspired encoder and an MLP head. On KITTI odometry, TSformer-VO achieves competitive performance against both geometry-based and end-to-end DL baselines, notably outperforming DeepVO and approaching ORB-SLAM3 on several metrics. The results demonstrate that space-time attention can effectively capture the spatio-temporal cues needed for VO without hand-engineered modules, with real-time inference and robustness to dynamic scenes when focusing attention on static structures.

Abstract

Estimating the camera's pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. Deep learning methods have been shown to be generalizable after proper training and with a large amount of available data. Transformer-based architectures have dominated the state-of-the-art in natural language processing and computer vision tasks, such as image and video understanding. In this work, we deal with the monocular visual odometry as a video understanding task to estimate the 6 degrees of freedom of a camera's pose. We contribute by presenting the TSformer-VO model based on spatio-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner. Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset, outperforming the DeepVO implementation highly accepted in the visual odometry community. The code is publicly available at https://github.com/aofrancani/TSformer-VO.

Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach

TL;DR

We address monocular visual odometry by reframing it as a video-understanding problem and propose TSformer-VO, a Transformer-based model that uses divided space-time self-attention to estimate the camera's -DoF poses from RGB clips in an end-to-end fashion. The method converts absolute ground-truth poses to relative motions, represents rotations with Euler angles, and regresses poses per clip with a TimeSformer-inspired encoder and an MLP head. On KITTI odometry, TSformer-VO achieves competitive performance against both geometry-based and end-to-end DL baselines, notably outperforming DeepVO and approaching ORB-SLAM3 on several metrics. The results demonstrate that space-time attention can effectively capture the spatio-temporal cues needed for VO without hand-engineered modules, with real-time inference and robustness to dynamic scenes when focusing attention on static structures.

Abstract

Estimating the camera's pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. Deep learning methods have been shown to be generalizable after proper training and with a large amount of available data. Transformer-based architectures have dominated the state-of-the-art in natural language processing and computer vision tasks, such as image and video understanding. In this work, we deal with the monocular visual odometry as a video understanding task to estimate the 6 degrees of freedom of a camera's pose. We contribute by presenting the TSformer-VO model based on spatio-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner. Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset, outperforming the DeepVO implementation highly accepted in the visual odometry community. The code is publicly available at https://github.com/aofrancani/TSformer-VO.
Paper Structure (24 sections, 10 equations, 8 figures, 6 tables)

This paper contains 24 sections, 10 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Traditional pipeline for visual odometry. The scenario images are taken from the KITTI odometry dataset Geiger2012CVPR.
  • Figure 2: TSformer-VO pipeline. The input clips with $N_f$ frames are processed into $N$ patches. Each patch is embedded into tokens and sent to the sequence of Transformer blocks. A special vector called class token (cls) gathers the information from all patches and passes through the final MLP head, outputting the 6-DoF for the $N_f -1$ estimated poses.
  • Figure 3: Transformer encoder with the divided space-time self-attention architecture. The illustration of the encoder was inspired by gberta_2021_ICML.
  • Figure 4: Visualization of the repeated motions, highlighted in yellow, for the particular case of $N_f = 3$ with $2$ overlapped frames.
  • Figure 5: Training and validation loss curves of TSformer-VO-1 architecture.
  • ...and 3 more figures