Table of Contents
Fetching ...

MASt3R-Fusion: Integrating Feed-Forward Visual Model with IMU, GNSS for High-Functionality SLAM

Yuxuan Zhou, Xingxing Li, Shengyu Li, Zhuohao Yan, Chunxi Xia, Shaoquan Feng

TL;DR

MASt3R-Fusion tackles scale ambiguity and robustness in visual SLAM by tightly fusing feed-forward pointmap regression with IMU and GNSS data. It introduces Sim(3)-based visual alignment constraints embedded in a metric SE(3) factor graph, bridged via an isomorphic transformation to enable joint optimization and global consistency. The system features real-time sliding-window VIO with dense 3D priors and a global optimization stage that leverages loop closures and GNSS for drift-free mapping, demonstrated on KITTI-360, SubT-MRS, and Wuhan urban datasets. Results show substantial accuracy gains and robustness over state-of-the-art methods, with open-source code to support reproducibility and further research.

Abstract

Visual SLAM is a cornerstone technique in robotics, autonomous driving and extended reality (XR), yet classical systems often struggle with low-texture environments, scale ambiguity, and degraded performance under challenging visual conditions. Recent advancements in feed-forward neural network-based pointmap regression have demonstrated the potential to recover high-fidelity 3D scene geometry directly from images, leveraging learned spatial priors to overcome limitations of traditional multi-view geometry methods. However, the widely validated advantages of probabilistic multi-sensor information fusion are often discarded in these pipelines. In this work, we propose MASt3R-Fusion,a multi-sensor-assisted visual SLAM framework that tightly integrates feed-forward pointmap regression with complementary sensor information, including inertial measurements and GNSS data. The system introduces Sim(3)-based visualalignment constraints (in the Hessian form) into a universal metric-scale SE(3) factor graph for effective information fusion. A hierarchical factor graph design is developed, which allows both real-time sliding-window optimization and global optimization with aggressive loop closures, enabling real-time pose tracking, metric-scale structure perception and globally consistent mapping. We evaluate our approach on both public benchmarks and self-collected datasets, demonstrating substantial improvements in accuracy and robustness over existing visual-centered multi-sensor SLAM systems. The code will be released open-source to support reproducibility and further research (https://github.com/GREAT-WHU/MASt3R-Fusion).

MASt3R-Fusion: Integrating Feed-Forward Visual Model with IMU, GNSS for High-Functionality SLAM

TL;DR

MASt3R-Fusion tackles scale ambiguity and robustness in visual SLAM by tightly fusing feed-forward pointmap regression with IMU and GNSS data. It introduces Sim(3)-based visual alignment constraints embedded in a metric SE(3) factor graph, bridged via an isomorphic transformation to enable joint optimization and global consistency. The system features real-time sliding-window VIO with dense 3D priors and a global optimization stage that leverages loop closures and GNSS for drift-free mapping, demonstrated on KITTI-360, SubT-MRS, and Wuhan urban datasets. Results show substantial accuracy gains and robustness over state-of-the-art methods, with open-source code to support reproducibility and further research.

Abstract

Visual SLAM is a cornerstone technique in robotics, autonomous driving and extended reality (XR), yet classical systems often struggle with low-texture environments, scale ambiguity, and degraded performance under challenging visual conditions. Recent advancements in feed-forward neural network-based pointmap regression have demonstrated the potential to recover high-fidelity 3D scene geometry directly from images, leveraging learned spatial priors to overcome limitations of traditional multi-view geometry methods. However, the widely validated advantages of probabilistic multi-sensor information fusion are often discarded in these pipelines. In this work, we propose MASt3R-Fusion,a multi-sensor-assisted visual SLAM framework that tightly integrates feed-forward pointmap regression with complementary sensor information, including inertial measurements and GNSS data. The system introduces Sim(3)-based visualalignment constraints (in the Hessian form) into a universal metric-scale SE(3) factor graph for effective information fusion. A hierarchical factor graph design is developed, which allows both real-time sliding-window optimization and global optimization with aggressive loop closures, enabling real-time pose tracking, metric-scale structure perception and globally consistent mapping. We evaluate our approach on both public benchmarks and self-collected datasets, demonstrating substantial improvements in accuracy and robustness over existing visual-centered multi-sensor SLAM systems. The code will be released open-source to support reproducibility and further research (https://github.com/GREAT-WHU/MASt3R-Fusion).

Paper Structure

This paper contains 20 sections, 46 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Illustration of the MASt3R-Fusion system. Building upon the 3D perception and data association provided by feed-forward model , this system tightly integrates multi-sensor information (e.g., IMU, GNSS), achieving globally consistent pose estimation and mapping. The heatmap indicates the inter-frame angle of cross-temporal data association. The system is capable of handling arbitrarily long sequences based on 8 GB GPU memory.
  • Figure 2: Flowchart of the system, consisting of real-time SLAM and global optimization stages.
  • Figure 3: Illustration of the two-view feed-forward model. Two images are encoded into feature tokens, which are then jointly decoded into two 2D-to-3D pointmaps (with $i$ as the common reference frame), together with two descriptor maps.
  • Figure 4: Illustration of matching process and the construction of visual constraints. Note that the matching process is based on the pointmaps from a temporary two-view feed-forward, whereas the projection residuals are constructed using the pointmaps maintained by the system.
  • Figure 5: Illustration of different forms of visual constraints. Left: bundle adjustment, in which landmarks and camera poses are jointly optimized with little prior knowledge. Right: pointmap alignment, in which camera poses and pointmap scales are optimized, built upon the knowledge of the scaleless 3D structure.
  • ...and 12 more figures