Table of Contents
Fetching ...

AIM-SLAM: Dense Monocular SLAM via Adaptive and Informative Multi-View Keyframe Prioritization with Foundation Model

Jinwoo Jeon, Dong-Uk Seo, Eungchang Mason Lee, Hyun Myung

TL;DR

A dense monocular SLAM framework that exploits an adaptive and informative multi-view keyframe prioritization with dense pointmap predictions from visual geometry grounded transformer (VGGT) and forms a joint multi-view Sim(3) optimization that enforces consistent alignment across selected views, substantially improving pose estimation accuracy.

Abstract

Recent advances in geometric foundation models have emerged as a promising alternative for addressing the challenge of dense reconstruction in monocular visual simultaneous localization and mapping (SLAM). Although geometric foundation models enable SLAM to leverage variable input views, the previous methods remain confined to two-view pairs or fixed-length inputs without sufficient deliberation of geometric context for view selection. To tackle this problem, we propose AIM-SLAM, a dense monocular SLAM framework that exploits an adaptive and informative multi-view keyframe prioritization with dense pointmap predictions from visual geometry grounded transformer (VGGT). Specifically, we introduce the selective information- and geometric-aware multi-view adaptation (SIGMA) module, which employs voxel overlap and information gain to retrieve a candidate set of keyframes and adaptively determine its size. Furthermore, we formulate a joint multi-view Sim(3) optimization that enforces consistent alignment across selected views, substantially improving pose estimation accuracy. The effectiveness of AIM-SLAM is demonstrated on real-world datasets, where it achieves state-of-the-art performance in both pose estimation and dense reconstruction. Our system supports ROS integration, with code is available at https://aimslam.github.io/.

AIM-SLAM: Dense Monocular SLAM via Adaptive and Informative Multi-View Keyframe Prioritization with Foundation Model

TL;DR

A dense monocular SLAM framework that exploits an adaptive and informative multi-view keyframe prioritization with dense pointmap predictions from visual geometry grounded transformer (VGGT) and forms a joint multi-view Sim(3) optimization that enforces consistent alignment across selected views, substantially improving pose estimation accuracy.

Abstract

Recent advances in geometric foundation models have emerged as a promising alternative for addressing the challenge of dense reconstruction in monocular visual simultaneous localization and mapping (SLAM). Although geometric foundation models enable SLAM to leverage variable input views, the previous methods remain confined to two-view pairs or fixed-length inputs without sufficient deliberation of geometric context for view selection. To tackle this problem, we propose AIM-SLAM, a dense monocular SLAM framework that exploits an adaptive and informative multi-view keyframe prioritization with dense pointmap predictions from visual geometry grounded transformer (VGGT). Specifically, we introduce the selective information- and geometric-aware multi-view adaptation (SIGMA) module, which employs voxel overlap and information gain to retrieve a candidate set of keyframes and adaptively determine its size. Furthermore, we formulate a joint multi-view Sim(3) optimization that enforces consistent alignment across selected views, substantially improving pose estimation accuracy. The effectiveness of AIM-SLAM is demonstrated on real-world datasets, where it achieves state-of-the-art performance in both pose estimation and dense reconstruction. Our system supports ROS integration, with code is available at https://aimslam.github.io/.
Paper Structure (24 sections, 8 equations, 7 figures, 4 tables)

This paper contains 24 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison among MASt3R-SLAM mast3r-slam, VGGT-SLAM vggtslam, and the proposed AIM-SLAM. MASt3R-SLAM relies on a fixed two-view input, while VGGT-SLAM processes a fixed chunk of consecutive keyframes. In contrast, AIM-SLAM adaptively prioritizes a variable number of keyframes with high viewpoint overlap and information gain. By jointly optimizing these multi-view inputs in $\mathrm{Sim}(3)$ space, AIM-SLAM achieves accurate and globally consistent dense reconstruction.
  • Figure 2: Overall architecture of AIM-SLAM. The frontend consists of (a) multi-view prioritization method via the proposed SIGMA module, followed by VGGT-based dense pointmap inference, and (b) joint multi-view $\mathrm{Sim}(3)$ optimization to mitigate short- and mid-term drift. The backend loop closure module performs global pose-graph optimization to ensure global consistency.
  • Figure 3: Block diagram of the proposed SIGMA module, which consists of three stages: (i) geometry-based subset initialization via voxel overlap, (ii) information-driven re-ranking based on covariance reduction, and (iii) adaptive activation regulated by a stability test. After each multi-view optimization, updated poses and confidences recurrently trigger the re-ranking process.
  • Figure 4: Example of a voxel-indexed keyframe map for computing view overlap. Each voxel stores the IDs of keyframes that observe it. Using the last keyframe as the query, the system counts shared voxels and selects the $\text{top-}N$ overlapping keyframes to initialize the multi-view subset.
  • Figure 5: Effect of the SIGMA module on keyframe uncertainty reduction. Compared with the case without re-ranking (lower), incorporating information-driven re-ranking (upper) significantly decreases keyframe uncertainty, computed as the inverse of the fused point confidence aggregated across observations during optimization. This shows that the SIGMA module effectively retrieves informative frames to refine the keyframe. Uncertainty is visualized in color, with higher values shown in warmer colors. Regions with pronounced differences are highlighted with red rectangles.
  • ...and 2 more figures