Table of Contents
Fetching ...

Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry

Yunus Bilge Kurt, Ahmet Akman, A. Aydın Alatan

TL;DR

This study aims to improve pose estimation accuracy by leveraging the attention mechanisms in transformers, which better utilize historical data compared to the recurrent neural network (RNN) based methods seen in recent methods.

Abstract

In recent years, transformer-based architectures become the de facto standard for sequence modeling in deep learning frameworks. Inspired by the successful examples, we propose a causal visual-inertial fusion transformer (VIFT) for pose estimation in deep visual-inertial odometry. This study aims to improve pose estimation accuracy by leveraging the attention mechanisms in transformers, which better utilize historical data compared to the recurrent neural network (RNN) based methods seen in recent methods. Transformers typically require large-scale data for training. To address this issue, we utilize inductive biases for deep VIO networks. Since latent visual-inertial feature vectors encompass essential information for pose estimation, we employ transformers to refine pose estimates by updating latent vectors temporally. Our study also examines the impact of data imbalance and rotation learning methods in supervised end-to-end learning of visual inertial odometry by utilizing specialized gradients in backpropagation for the elements of SE$(3)$ group. The proposed method is end-to-end trainable and requires only a monocular camera and IMU during inference. Experimental results demonstrate that VIFT increases the accuracy of monocular VIO networks, achieving state-of-the-art results when compared to previous methods on the KITTI dataset. The code will be made available at https://github.com/ybkurt/VIFT.

Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry

TL;DR

This study aims to improve pose estimation accuracy by leveraging the attention mechanisms in transformers, which better utilize historical data compared to the recurrent neural network (RNN) based methods seen in recent methods.

Abstract

In recent years, transformer-based architectures become the de facto standard for sequence modeling in deep learning frameworks. Inspired by the successful examples, we propose a causal visual-inertial fusion transformer (VIFT) for pose estimation in deep visual-inertial odometry. This study aims to improve pose estimation accuracy by leveraging the attention mechanisms in transformers, which better utilize historical data compared to the recurrent neural network (RNN) based methods seen in recent methods. Transformers typically require large-scale data for training. To address this issue, we utilize inductive biases for deep VIO networks. Since latent visual-inertial feature vectors encompass essential information for pose estimation, we employ transformers to refine pose estimates by updating latent vectors temporally. Our study also examines the impact of data imbalance and rotation learning methods in supervised end-to-end learning of visual inertial odometry by utilizing specialized gradients in backpropagation for the elements of SE group. The proposed method is end-to-end trainable and requires only a monocular camera and IMU during inference. Experimental results demonstrate that VIFT increases the accuracy of monocular VIO networks, achieving state-of-the-art results when compared to previous methods on the KITTI dataset. The code will be made available at https://github.com/ybkurt/VIFT.
Paper Structure (17 sections, 4 equations, 3 figures, 2 tables)

This paper contains 17 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: VIFT architecture. The network consists of two fundamental sides. The first side consists of two encoders with frozen weights that map visual and inertial information to a latent space. The second side consists of sequential transformer layers followed by a fully connected layer. For backpropagation enhancement of rotation, the output is projected to $3\times 3$ rotation matrix representation, and RPMG (Regularized Projective Manifold Gradient) chen2022projective is used.
  • Figure 2: Causal transformer based architecture for fusion and pose estimation.
  • Figure 3: Proposed transformer based fusion and pose estimation module in VIFT evaluated under different training settings. We mark trajectories every 5 seconds for intuition about the vehicle's speed along the trajectory and easy distinction of results. We emphasize that the camera and IMU provide 10 FPS and 100 Hz measurements, respectively, which are much more frequent than marked locations. We show the estimated trajectory in test sequences from above in the top row and vertical trajectory versus the bottom row. All trajectories start from the origin, and relative pose estimates from VIFT are applied sequentially to obtain absolute pose estimates for each time index.