Table of Contents
Fetching ...

UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation

Jinho Park, Se Young Chun, Mingoo Seok

TL;DR

UL-VIO is proposed -- an ultra-lightweight (<1M) VIO network capable of test-time adaptation (TTA) based on visual-inertial consistency and achieves 36X smaller network size than state-of-the-art with a minute increase in error -- 1% on the KITTI dataset.

Abstract

Data-driven visual-inertial odometry (VIO) has received highlights for its performance since VIOs are a crucial compartment in autonomous robots. However, their deployment on resource-constrained devices is non-trivial since large network parameters should be accommodated in the device memory. Furthermore, these networks may risk failure post-deployment due to environmental distribution shifts at test time. In light of this, we propose UL-VIO -- an ultra-lightweight (<1M) VIO network capable of test-time adaptation (TTA) based on visual-inertial consistency. Specifically, we perform model compression to the network while preserving the low-level encoder part, including all BatchNorm parameters for resource-efficient test-time adaptation. It achieves 36X smaller network size than state-of-the-art with a minute increase in error -- 1% on the KITTI dataset. For test-time adaptation, we propose to use the inertia-referred network outputs as pseudo labels and update the BatchNorm parameter for lightweight yet effective adaptation. To the best of our knowledge, this is the first work to perform noise-robust TTA on VIO. Experimental results on the KITTI, EuRoC, and Marulan datasets demonstrate the effectiveness of our resource-efficient adaptation method under diverse TTA scenarios with dynamic domain shifts.

UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation

TL;DR

UL-VIO is proposed -- an ultra-lightweight (<1M) VIO network capable of test-time adaptation (TTA) based on visual-inertial consistency and achieves 36X smaller network size than state-of-the-art with a minute increase in error -- 1% on the KITTI dataset.

Abstract

Data-driven visual-inertial odometry (VIO) has received highlights for its performance since VIOs are a crucial compartment in autonomous robots. However, their deployment on resource-constrained devices is non-trivial since large network parameters should be accommodated in the device memory. Furthermore, these networks may risk failure post-deployment due to environmental distribution shifts at test time. In light of this, we propose UL-VIO -- an ultra-lightweight (<1M) VIO network capable of test-time adaptation (TTA) based on visual-inertial consistency. Specifically, we perform model compression to the network while preserving the low-level encoder part, including all BatchNorm parameters for resource-efficient test-time adaptation. It achieves 36X smaller network size than state-of-the-art with a minute increase in error -- 1% on the KITTI dataset. For test-time adaptation, we propose to use the inertia-referred network outputs as pseudo labels and update the BatchNorm parameter for lightweight yet effective adaptation. To the best of our knowledge, this is the first work to perform noise-robust TTA on VIO. Experimental results on the KITTI, EuRoC, and Marulan datasets demonstrate the effectiveness of our resource-efficient adaptation method under diverse TTA scenarios with dynamic domain shifts.
Paper Structure (12 sections, 6 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 12 sections, 6 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: We address a domain shift problem that is likely to occur during driving scenarios. To emulate real-world driving scenarios, we introduce various vision noises into the image sequence inputted into the VIO model. We continuously run multiple odometry sequences to assess test-time adaptation without forgetting.
  • Figure 2: Overall framework setup for UL-VIO. The network has two input streams -- visual and inertial. Modulated by the noise signal, the environment simulator emulates the adversarial weather conditions. The network adapts using inertial input as the pseudo label when the adaptation gating signal is turned on. Parallel multi-modal encoders independently generate the visual and inertial features. Two pose outputs are generated based on visual-inertial feature fusion or inertial-only.
  • Figure 3: Model compression. We shrink the module size but keep the low-level parts in the visual encoder, including all BN parameters, to ensure test-time adaptation. We achieve $\{117 \times, 8 \times,161 \times\}$ reduction in $\{E_\text{visual}, E_\text{inertial},D_\text{inertial}\}$
  • Figure 4: Lightweight visual encoder with dictionary-based adaptation. The statistics of intermediate feature maps during and after the first layer are taken to generate ddfs. Although aggressively reducing the visual parameter footprint, we maintain the BN parameters intact for adaptation.
  • Figure 5: Motivation for consistency loss. (a) On a clean setting, visual feature-based inference far surpasses that of inertial. The tick represents the standard deviation. (b) Pose outputs from fused features are much affected under noisy environments. (c) A strong correlation ($r=0.86$) is shown between the relative translation error of the predicted pose against the ground truth ($x$-axis) and the inertial-inferred pseudo label ($y$-axis).
  • ...and 6 more figures