Table of Contents
Fetching ...

ICD-Net: Inertial Covariance Displacement Network for Drone Visual-Inertial SLAM

Tali Orlev Shapira, Itzik Klein

TL;DR

This work introduces ICD-Net, a two-head neural network that learns inertial displacement and uncertainty directly from raw IMU data to augment visual-inertial SLAM. By predicting per-axis covariances and integrating them as residuals in the VINS-Fusion optimization, the approach compensates for calibration errors, sensor noise, and high-dynamics common in drone flight. The method demonstrates significant reductions in absolute pose error across challenging high-speed drone sequences and remains robust during camera blackouts, with the uncertainty estimates effectively weighting neural constraints in the optimization. ICD-Net also functions as a standalone inertial odometry system and holds promise for broader adoption in Kalman-filter-based pipelines and other probabilistic estimators.

Abstract

Visual-inertial SLAM systems often exhibit suboptimal performance due to multiple confounding factors including imperfect sensor calibration, noisy measurements, rapid motion dynamics, low illumination, and the inherent limitations of traditional inertial navigation integration methods. These issues are particularly problematic in drone applications where robust and accurate state estimation is critical for safe autonomous operation. In this work, we present ICD-Net, a novel framework that enhances visual-inertial SLAM performance by learning to process raw inertial measurements and generating displacement estimates with associated uncertainty quantification. Rather than relying on analytical inertial sensor models that struggle with real-world sensor imperfections, our method directly extracts displacement maps from sensor data while simultaneously predicting measurement covariances that reflect estimation confidence. We integrate ICD-Net outputs as additional residual constraints into the VINS-Fusion optimization framework, where the predicted uncertainties appropriately weight the neural network contributions relative to traditional visual and inertial terms. The learned displacement constraints provide complementary information that compensates for various error sources in the SLAM pipeline. Our approach can be used under both normal operating conditions and in situations of camera inconsistency or visual degradation. Experimental evaluation on challenging high-speed drone sequences demonstrated that our approach significantly improved trajectory estimation accuracy compared to standard VINS-Fusion, with more than 38% improvement in mean APE and uncertainty estimates proving crucial for maintaining system robustness. Our method shows that neural network enhancement can effectively address multiple sources of SLAM degradation while maintaining real-time performance requirements.

ICD-Net: Inertial Covariance Displacement Network for Drone Visual-Inertial SLAM

TL;DR

This work introduces ICD-Net, a two-head neural network that learns inertial displacement and uncertainty directly from raw IMU data to augment visual-inertial SLAM. By predicting per-axis covariances and integrating them as residuals in the VINS-Fusion optimization, the approach compensates for calibration errors, sensor noise, and high-dynamics common in drone flight. The method demonstrates significant reductions in absolute pose error across challenging high-speed drone sequences and remains robust during camera blackouts, with the uncertainty estimates effectively weighting neural constraints in the optimization. ICD-Net also functions as a standalone inertial odometry system and holds promise for broader adoption in Kalman-filter-based pipelines and other probabilistic estimators.

Abstract

Visual-inertial SLAM systems often exhibit suboptimal performance due to multiple confounding factors including imperfect sensor calibration, noisy measurements, rapid motion dynamics, low illumination, and the inherent limitations of traditional inertial navigation integration methods. These issues are particularly problematic in drone applications where robust and accurate state estimation is critical for safe autonomous operation. In this work, we present ICD-Net, a novel framework that enhances visual-inertial SLAM performance by learning to process raw inertial measurements and generating displacement estimates with associated uncertainty quantification. Rather than relying on analytical inertial sensor models that struggle with real-world sensor imperfections, our method directly extracts displacement maps from sensor data while simultaneously predicting measurement covariances that reflect estimation confidence. We integrate ICD-Net outputs as additional residual constraints into the VINS-Fusion optimization framework, where the predicted uncertainties appropriately weight the neural network contributions relative to traditional visual and inertial terms. The learned displacement constraints provide complementary information that compensates for various error sources in the SLAM pipeline. Our approach can be used under both normal operating conditions and in situations of camera inconsistency or visual degradation. Experimental evaluation on challenging high-speed drone sequences demonstrated that our approach significantly improved trajectory estimation accuracy compared to standard VINS-Fusion, with more than 38% improvement in mean APE and uncertainty estimates proving crucial for maintaining system robustness. Our method shows that neural network enhancement can effectively address multiple sources of SLAM degradation while maintaining real-time performance requirements.

Paper Structure

This paper contains 19 sections, 28 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Enhanced VINS-Fusion framework with our proposed modifications. We introduce ICD-Net for learning displacement and uncertainty estimates from inertial data, and augment the VINS-Fusion optimization with additional neural network-based loss terms (highlighted in red): $\mathcal{L}_{\text{NN-speed}}$ and $\mathcal{L}_{\text{smoothness}}$.
  • Figure 2: The ICD-Net architecture uses two heads to process the inertial readings and output the displacement and covariance estimates.
  • Figure 3: Comparison of ground truth (GT) and predicted (Pred) trajectories for X, Y, and Z coordinates across three different scenarios (A - indoor_forward_6, B - indoor_forward_9, and C - indoor_forward_3). The solid blue lines represent ground truth values, dashed red lines show predictions, and the shaded regions indicate $\pm 1\sigma$ uncertainty bounds. The predictions generally follow the ground truth trajectories with varying degrees of accuracy across the three spatial dimensions. The overall results demonstrate good performance, with the network successfully predicting the general trajectory direction across various scenarios and movement patterns.
  • Figure 4: indoor_forward_6 trajectory comparison. The baseline is standard VINS-Fusion, while the ICD-Net framework is VINS-Fusion after integrating our network's predictions. Dashed lines represent ground truth trajectories, while colored lines show VINS-Fusion estimates. The results demonstrate significant performance improvement with reduced errors.
  • Figure 5: indoor_forward_3 trajectory with camera inconsistency for the (a) baseline and (b) our ICD-Net framework. The baseline VINS-Fusion produces trajectories that bear no resemblance to the ground truth circular path. The ICD-Net integrated system successfully retains circular motion patterns and remains within the ground truth operational area.