Table of Contents
Fetching ...

XIRVIO: Critic-guided Iterative Refinement for Visual-Inertial Odometry with Explainable Adaptive Weighting

Chit Yuen Lam, Ronald Clark, Basaran Bahadir Kocer

TL;DR

XIRVIO addresses robust monocular visual-inertial odometry by integrating a transformer-based generator with a critic-guided iterative refinement loop and a self-emergent adaptive sensor weighting policy. It leverages a RAFT-based visual encoder and dedicated inertial encoders to produce modality-specific features, which are weighted by a Policy Encoder and refined through successive pose deltas under a WGAN-GP framework; the critic selects the best refinement iteration while providing explainable feedback. On KITTI, XIRVIO achieves competitive performance relative to state-of-the-art learning-based VIO methods and demonstrates meaningful, human-interpretable sensor weighting that adapts to context. The work advances VIO by offering both high accuracy and explainability for safety-critical robotic applications, with potential to incorporate additional sensor modalities in the future.

Abstract

We introduce XIRVIO, a transformer-based Generative Adversarial Network (GAN) framework for monocular visual inertial odometry (VIO). By taking sequences of images and 6-DoF inertial measurements as inputs, XIRVIO's generator predicts pose trajectories through an iterative refinement process which are then evaluated by the critic to select the iteration with the optimised prediction. Additionally, the self-emergent adaptive sensor weighting reveals how XIRVIO attends to each sensory input based on contextual cues in the data, making it a promising approach for achieving explainability in safety-critical VIO applications. Evaluations on the KITTI dataset demonstrate that XIRVIO matches well-known state-of-the-art learning-based methods in terms of both translation and rotation errors.

XIRVIO: Critic-guided Iterative Refinement for Visual-Inertial Odometry with Explainable Adaptive Weighting

TL;DR

XIRVIO addresses robust monocular visual-inertial odometry by integrating a transformer-based generator with a critic-guided iterative refinement loop and a self-emergent adaptive sensor weighting policy. It leverages a RAFT-based visual encoder and dedicated inertial encoders to produce modality-specific features, which are weighted by a Policy Encoder and refined through successive pose deltas under a WGAN-GP framework; the critic selects the best refinement iteration while providing explainable feedback. On KITTI, XIRVIO achieves competitive performance relative to state-of-the-art learning-based VIO methods and demonstrates meaningful, human-interpretable sensor weighting that adapts to context. The work advances VIO by offering both high accuracy and explainability for safety-critical robotic applications, with potential to incorporate additional sensor modalities in the future.

Abstract

We introduce XIRVIO, a transformer-based Generative Adversarial Network (GAN) framework for monocular visual inertial odometry (VIO). By taking sequences of images and 6-DoF inertial measurements as inputs, XIRVIO's generator predicts pose trajectories through an iterative refinement process which are then evaluated by the critic to select the iteration with the optimised prediction. Additionally, the self-emergent adaptive sensor weighting reveals how XIRVIO attends to each sensory input based on contextual cues in the data, making it a promising approach for achieving explainability in safety-critical VIO applications. Evaluations on the KITTI dataset demonstrate that XIRVIO matches well-known state-of-the-art learning-based methods in terms of both translation and rotation errors.

Paper Structure

This paper contains 24 sections, 6 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview of our self-emergent adaptive sensor weighting for a sample trajectory from the KITTI dataset. The stacked bars represent how the policy encoder dynamically allocates weights to each sensor modality. The labels "Flow", "Rot", and "Tx" on the bar charts denote optical flow, IMU rotation, and IMU translation respectively. The weights are normalised (min–max) within each modality for clarity.
  • Figure 2: An overview of the XIRVIO architecture. The generator encodes the image and IMU inputs, produces an adaptive sensor weighting, and generates pose predictions iteratively. These poses are passed onto the critic together with the encoded vectors to obtain a critic score.
  • Figure 3: A simplified overview of critic-guided iterative refinement. The Generative-Iterative Pose Transformer $G_T$, generates and iteratively refines the pose estimations. All the pose iterations are evaluated by the critic $C$ to obtain a critic score, and the iteration with the best critic score will be selected as the final pose estimation.
  • Figure 4: Variation of pose loss and negative critic score per iteration.
  • Figure 5: Variation of model performance against the number of iterations.
  • ...and 1 more figures