Table of Contents
Fetching ...

OKVIS2-X: Open Keyframe-based Visual-Inertial SLAM Configurable with Dense Depth or LiDAR, and GNSS

Simon Boche, Jaehyung Jung, Sebastián Barbas Laina, Stefan Leutenegger

TL;DR

OKVIS2-X tackles the challenge of robust, accurate SLAM in large-scale environments by unifying visual-inertial sensing with dense volumetric occupancy mapping and optional depth or LiDAR inputs, all fused within a single factor-graph optimization. It introduces submap-based occupancy maps tightly integrated with the estimator, online camera-IMU extrinsics calibration, and GNSS fusion, enabling globally consistent maps and real-time operation up to $9\,\mathrm{km}$ trajectories. The approach is validated across EuRoC, Hilti-Oxford, and VBR datasets, showing state-of-the-art results in VI and VI-LiDAR configurations and demonstrating resilience to GNSS outages. The work advances practical autonomous navigation by delivering dense, usable maps with strong accuracy, scalability, and robustness, and it is released as open source for the community.

Abstract

To empower mobile robots with usable maps as well as highest state estimation accuracy and robustness, we present OKVIS2-X: a state-of-the-art multi-sensor Simultaneous Localization and Mapping (SLAM) system building dense volumetric occupancy maps, while scalable to large environments and operating in realtime. Our unified SLAM framework seamlessly integrates different sensor modalities: visual, inertial, measured or learned depth, LiDAR and Global Navigation Satellite System (GNSS) measurements. Unlike most state-of-the-art SLAM systems, we advocate using dense volumetric map representations when leveraging depth or range-sensing capabilities. We employ an efficient submapping strategy that allows our system to scale to large environments, showcased in sequences of up to 9 kilometers. OKVIS2-X enhances its accuracy and robustness by tightly-coupling the estimator and submaps through map alignment factors. Our system provides globally consistent maps, directly usable for autonomous navigation. To further improve the accuracy of OKVIS2-X, we also incorporate the option of performing online calibration of camera extrinsics. Our system achieves the highest trajectory accuracy in EuRoC against state-of-the-art alternatives, outperforms all competitors in the Hilti22 VI-only benchmark, while also proving competitive in the LiDAR version, and showcases state of the art accuracy in the diverse and large-scale sequences from the VBR dataset.

OKVIS2-X: Open Keyframe-based Visual-Inertial SLAM Configurable with Dense Depth or LiDAR, and GNSS

TL;DR

OKVIS2-X tackles the challenge of robust, accurate SLAM in large-scale environments by unifying visual-inertial sensing with dense volumetric occupancy mapping and optional depth or LiDAR inputs, all fused within a single factor-graph optimization. It introduces submap-based occupancy maps tightly integrated with the estimator, online camera-IMU extrinsics calibration, and GNSS fusion, enabling globally consistent maps and real-time operation up to trajectories. The approach is validated across EuRoC, Hilti-Oxford, and VBR datasets, showing state-of-the-art results in VI and VI-LiDAR configurations and demonstrating resilience to GNSS outages. The work advances practical autonomous navigation by delivering dense, usable maps with strong accuracy, scalability, and robustness, and it is released as open source for the community.

Abstract

To empower mobile robots with usable maps as well as highest state estimation accuracy and robustness, we present OKVIS2-X: a state-of-the-art multi-sensor Simultaneous Localization and Mapping (SLAM) system building dense volumetric occupancy maps, while scalable to large environments and operating in realtime. Our unified SLAM framework seamlessly integrates different sensor modalities: visual, inertial, measured or learned depth, LiDAR and Global Navigation Satellite System (GNSS) measurements. Unlike most state-of-the-art SLAM systems, we advocate using dense volumetric map representations when leveraging depth or range-sensing capabilities. We employ an efficient submapping strategy that allows our system to scale to large environments, showcased in sequences of up to 9 kilometers. OKVIS2-X enhances its accuracy and robustness by tightly-coupling the estimator and submaps through map alignment factors. Our system provides globally consistent maps, directly usable for autonomous navigation. To further improve the accuracy of OKVIS2-X, we also incorporate the option of performing online calibration of camera extrinsics. Our system achieves the highest trajectory accuracy in EuRoC against state-of-the-art alternatives, outperforms all competitors in the Hilti22 VI-only benchmark, while also proving competitive in the LiDAR version, and showcases state of the art accuracy in the diverse and large-scale sequences from the VBR dataset.

Paper Structure

This paper contains 41 sections, 33 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: 3D reconstruction from a run of OKVIS2-X on the Spagna sequence of the VBR dataset VBR. Reconstruction with a LiDAR sensor (top) or with a depth network (bottom) to showcase the versatility of the presented system to different sensor modalities. The estimated trajectory is visualized in black. Furthermore, different colors per submap are used.
  • Figure 2: System architecture of our proposed multi-sensor state-estimator OKVIS2-X. Components with a grey background correspond to original elements from OKVIS2 OKVIS2 and components with a white background are the extensions for the multi-sensor setup.
  • Figure 3: Predicted inverse depth (top) and its corresponding standard deviation (bottom) of (a) stereo network with the $11\,\text{cm}$ baseline, (b) MVS network with the $50\,\text{cm}$ maximum baseline among $8\,$views, and (c) depth fusion in the EuRoC dataset. (Adopted from jung2024uncertainty.)
  • Figure 4: Initially a full batch VI factor graph (a) is created and optimized. Later (b), frames with least overlap with the live frame and current keyframe are turned into posegraph poses by construction of relative pose errors under marginalization of common observations; also, old poses and speed/bias variables are fixed to keep the problem realtime capable. When a loop-closure occurs (c), respective observations and landmarks are re-activated. The proposed system furthermore supports online calibration of the IMU-camera extrinsics (d). ((a-c) Adopted from OKVIS2.)
  • Figure 5: Factor Graph including dense submap alignment. Left: The realtime estimator connects set of current keyframe and non-keyframe states by IMU errors and visual reprojection errors. For every state in the optimization window, frame-to-map factors are formulated between every live state and the keyframe state associated to the last completed submap. Right: Measurements between frames can be aggregated and map-to-map factors can be added to the factor graph between submap keyframe states if the geometric overlap surpasses a threshold. (Adopted from boche2024tightlycoupled.)
  • ...and 9 more figures