XIRVIO: Critic-guided Iterative Refinement for Visual-Inertial Odometry with Explainable Adaptive Weighting
Chit Yuen Lam, Ronald Clark, Basaran Bahadir Kocer
TL;DR
XIRVIO addresses robust monocular visual-inertial odometry by integrating a transformer-based generator with a critic-guided iterative refinement loop and a self-emergent adaptive sensor weighting policy. It leverages a RAFT-based visual encoder and dedicated inertial encoders to produce modality-specific features, which are weighted by a Policy Encoder and refined through successive pose deltas under a WGAN-GP framework; the critic selects the best refinement iteration while providing explainable feedback. On KITTI, XIRVIO achieves competitive performance relative to state-of-the-art learning-based VIO methods and demonstrates meaningful, human-interpretable sensor weighting that adapts to context. The work advances VIO by offering both high accuracy and explainability for safety-critical robotic applications, with potential to incorporate additional sensor modalities in the future.
Abstract
We introduce XIRVIO, a transformer-based Generative Adversarial Network (GAN) framework for monocular visual inertial odometry (VIO). By taking sequences of images and 6-DoF inertial measurements as inputs, XIRVIO's generator predicts pose trajectories through an iterative refinement process which are then evaluated by the critic to select the iteration with the optimised prediction. Additionally, the self-emergent adaptive sensor weighting reveals how XIRVIO attends to each sensory input based on contextual cues in the data, making it a promising approach for achieving explainability in safety-critical VIO applications. Evaluations on the KITTI dataset demonstrate that XIRVIO matches well-known state-of-the-art learning-based methods in terms of both translation and rotation errors.
