Table of Contents
Fetching ...

DVI-SLAM: A Dual Visual Inertial SLAM Network

Xiongfeng Peng, Zhihua Liu, Weiming Li, Ping Tan, SoonYong Cho, Qiang Wang

TL;DR

DVI-SLAM addresses the challenge of effectively fusing multiple visual cues with IMU data in SLAM by introducing a differentiable, end-to-end framework that dynamically weighs re-projection, feature-metric, and inertial residuals through learned confidence maps. It extends the DROID-SLAM paradigm with a dual-visual-factor design and a multi-factor DBA layer, enabling tightly-coupled optimization over pose, depth, and IMU motion. The approach achieves state-of-the-art results on TartanAir, EuRoC, and ETH3D-SLAM, with large reductions in absolute trajectory error when all three factors are fused. This work demonstrates the practical value of dynamic, reliability-weighted factor fusion for robust visual-inertial navigation and mapping, with potential extensions to deeper integration of IMU factors and richer 3D scene representations.

Abstract

Recent deep learning based visual simultaneous localization and mapping (SLAM) methods have made significant progress. However, how to make full use of visual information as well as better integrate with inertial measurement unit (IMU) in visual SLAM has potential research value. This paper proposes a novel deep SLAM network with dual visual factors. The basic idea is to integrate both photometric factor and re-projection factor into the end-to-end differentiable structure through multi-factor data association module. We show that the proposed network dynamically learns and adjusts the confidence maps of both visual factors and it can be further extended to include the IMU factors as well. Extensive experiments validate that our proposed method significantly outperforms the state-of-the-art methods on several public datasets, including TartanAir, EuRoC and ETH3D-SLAM. Specifically, when dynamically fusing the three factors together, the absolute trajectory error for both monocular and stereo configurations on EuRoC dataset has reduced by 45.3% and 36.2% respectively.

DVI-SLAM: A Dual Visual Inertial SLAM Network

TL;DR

DVI-SLAM addresses the challenge of effectively fusing multiple visual cues with IMU data in SLAM by introducing a differentiable, end-to-end framework that dynamically weighs re-projection, feature-metric, and inertial residuals through learned confidence maps. It extends the DROID-SLAM paradigm with a dual-visual-factor design and a multi-factor DBA layer, enabling tightly-coupled optimization over pose, depth, and IMU motion. The approach achieves state-of-the-art results on TartanAir, EuRoC, and ETH3D-SLAM, with large reductions in absolute trajectory error when all three factors are fused. This work demonstrates the practical value of dynamic, reliability-weighted factor fusion for robust visual-inertial navigation and mapping, with potential extensions to deeper integration of IMU factors and richer 3D scene representations.

Abstract

Recent deep learning based visual simultaneous localization and mapping (SLAM) methods have made significant progress. However, how to make full use of visual information as well as better integrate with inertial measurement unit (IMU) in visual SLAM has potential research value. This paper proposes a novel deep SLAM network with dual visual factors. The basic idea is to integrate both photometric factor and re-projection factor into the end-to-end differentiable structure through multi-factor data association module. We show that the proposed network dynamically learns and adjusts the confidence maps of both visual factors and it can be further extended to include the IMU factors as well. Extensive experiments validate that our proposed method significantly outperforms the state-of-the-art methods on several public datasets, including TartanAir, EuRoC and ETH3D-SLAM. Specifically, when dynamically fusing the three factors together, the absolute trajectory error for both monocular and stereo configurations on EuRoC dataset has reduced by 45.3% and 36.2% respectively.
Paper Structure (26 sections, 8 equations, 3 figures, 6 tables)

This paper contains 26 sections, 8 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of our DVI-SLAM structure (the two-views reconstruction only for better clarity). (a) Visual correlation feature and appearance feature are extracted by the input two frames $\boldsymbol{I}_{i}, \boldsymbol{I}_{j}$. (b) The multi-factor data association module learns the feature correspondence and confidence maps, and IMU pre-integration results are calculated by the accelerometer and gyroscope measurements of IMU $\{\boldsymbol{\alpha}_{k}, \boldsymbol{\omega}_{k}\}$ between the two frames. (c) Those confidence maps dynamically fuse re-projection, feature-metric, and IMU residuals for optimization via a differentiable LM for the pose $\boldsymbol{T}$, IMU motion $\boldsymbol{M}$, and inverse depth $\boldsymbol{d}$. $\boldsymbol{T}$ is also calculated via the IMU propagation. Finally, $\boldsymbol{T}$, $\boldsymbol{M}$, and $\boldsymbol{d}$ are updated for the next iteration.
  • Figure 2: The confidence map changes iteratively. High confidence in the re-projection confidence map focuses on corner pixels and extends to edge pixels gradually with iteration increasing. For the feature-metric confidence map, the confidence is zero at the beginning of the iteration and focuses on the well-textured regions after several iterations.
  • Figure 3: The graph shows the change of average re-projection confidence and average feature-metric confidence with the number of iterations increasing.