Table of Contents
Fetching ...

Visual-Inertial SLAM as Simple as A, B, VINS

Nathaniel Merrill, Guoquan Huang

TL;DR

AB-VINS might not be as accurate as state-of-the-art VINS algorithms, but it is shown to be more robust.

Abstract

We present AB-VINS, a different kind of visual-inertial SLAM system. Unlike most popular VINS methods which only use hand-crafted techniques, AB-VINS makes use of three different deep neural networks. Instead of estimating sparse feature positions, AB-VINS only estimates the scale and bias parameters (a and b) of monocular depth maps, as well as other terms to correct the depth using multi-view information, which results in a compressed feature state. Despite being an optimization-based system, the front-end motion tracking thread of AB-VINS surpasses the efficiency of a state-of-the-art filtering-based method while also providing dense depth. When performing loop closures, standard keyframe-based SLAM systems need to relinearize a number of variables which is linear with respect to the number of keyframes. In contrast, the proposed AB-VINS can incorporate loop closures while only affecting a constant number of variables. This is thanks to a novel data structure called the memory tree, where keyframe poses are defined relative to each other rather than all in one global frame, allowing for all but a few states to be fixed. While AB-VINS might not be as accurate as state-of-the-art VINS algorithms, it is shown to be more robust.

Visual-Inertial SLAM as Simple as A, B, VINS

TL;DR

AB-VINS might not be as accurate as state-of-the-art VINS algorithms, but it is shown to be more robust.

Abstract

We present AB-VINS, a different kind of visual-inertial SLAM system. Unlike most popular VINS methods which only use hand-crafted techniques, AB-VINS makes use of three different deep neural networks. Instead of estimating sparse feature positions, AB-VINS only estimates the scale and bias parameters (a and b) of monocular depth maps, as well as other terms to correct the depth using multi-view information, which results in a compressed feature state. Despite being an optimization-based system, the front-end motion tracking thread of AB-VINS surpasses the efficiency of a state-of-the-art filtering-based method while also providing dense depth. When performing loop closures, standard keyframe-based SLAM systems need to relinearize a number of variables which is linear with respect to the number of keyframes. In contrast, the proposed AB-VINS can incorporate loop closures while only affecting a constant number of variables. This is thanks to a novel data structure called the memory tree, where keyframe poses are defined relative to each other rather than all in one global frame, allowing for all but a few states to be fixed. While AB-VINS might not be as accurate as state-of-the-art VINS algorithms, it is shown to be more robust.
Paper Structure (31 sections, 17 equations, 26 figures, 9 tables)

This paper contains 31 sections, 17 equations, 26 figures, 9 tables.

Figures (26)

  • Figure 1: Visualizing the proposed AB-VINS: On the bottom left is the current frame with feature tracks (red) and reprojected points (blue) overlayed. On the bottom center is the most recent keyframe depth map. On the bottom right is the code mask, which divides the images into different regions that are pushed and pulled to correct the depth according to multi-view information. In the main window at the top the memory tree can be seen, which is a novel data structure used to speed up pose graph optimization.
  • Figure 2: System diagram of the proposed AB-VINS.
  • Figure 3: A graphical representation of tracking to static map. The 6-DoF pose is estimated from the static map using only visual measurements, and an inertial-only optimization estimates the velocity and biases.
  • Figure 4: Left: The result of tightly coupling the visual and inertial measurements in the tracking optimization. Right: The result of the proposed decoupled approach. Tracked image coordinates are in red and reprojected points are in blue.
  • Figure 5: A graphical representation of the local mapping optimization with three keyframes. For concise presentation, calibration states are not shown.
  • ...and 21 more figures