Table of Contents
Fetching ...

VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency

Zhuang Xiong, Chen Zhang, Qingshan Xu, Wenbing Tao

TL;DR

VGGT-Motion addresses scale drift and geometric fragmentation in calibration-free monocular SLAM on long sequences by combining motion-aware submap construction with anchor-driven direct Sim(3) registration and a lightweight submap-level pose graph. The method partitions sequences into motion regimes, provides dense pixel-indexed correspondences via context-balanced anchors, and optimizes a sparse Sim(3) backend, achieving linear complexity. It delivers substantial improvements on zero-shot, kilometer-scale benchmarks (e.g., 4Seasons, Complex Urban, A2D2) with 85–95% reductions in ATE and Drift relative to previous VGGT variants, while attaining 18–36× speedups over state-of-the-art foundation-model-based SLAM. This work demonstrates a shift toward motion-aware representations and scalable, calibration-free SLAM suitable for unconstrained outdoor deployment, with potential integration of alternative foundation models and renderable scene representations.

Abstract

Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.

VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency

TL;DR

VGGT-Motion addresses scale drift and geometric fragmentation in calibration-free monocular SLAM on long sequences by combining motion-aware submap construction with anchor-driven direct Sim(3) registration and a lightweight submap-level pose graph. The method partitions sequences into motion regimes, provides dense pixel-indexed correspondences via context-balanced anchors, and optimizes a sparse Sim(3) backend, achieving linear complexity. It delivers substantial improvements on zero-shot, kilometer-scale benchmarks (e.g., 4Seasons, Complex Urban, A2D2) with 85–95% reductions in ATE and Drift relative to previous VGGT variants, while attaining 18–36× speedups over state-of-the-art foundation-model-based SLAM. This work demonstrates a shift toward motion-aware representations and scalable, calibration-free SLAM suitable for unconstrained outdoor deployment, with potential integration of alternative foundation models and renderable scene representations.

Abstract

Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.
Paper Structure (29 sections, 13 equations, 21 figures, 8 tables, 1 algorithm)

This paper contains 29 sections, 13 equations, 21 figures, 8 tables, 1 algorithm.

Figures (21)

  • Figure 1: Comparison of the SOTA method VGGT-Long deng2025vggt and our VGGT-Motion across sequences with varying frame counts. While VGGT-Long suffers from significant drift as the number of frames increases, our VGGT-Motion maintains high accuracy and rapid inference speeds across all sequence durations.
  • Figure 2: The pipeline of VGGT-Motion, consisting of three stages: (a) Motion-Aware Submap Construction, (b) Anchor-Driven Direct $Sim(3)$ Registration, and (c) Lightweight Pose Graph Optimization. During submap construction, estimated motion states are used to adaptively partition the sequence into base segments augmented with geometric anchors. For submap alignment, VGGT infers local dense geometry, and an anchor-driven strategy directly estimates $Sim(3)$ constraints. Finally, pose graph optimization recovers a globally consistent trajectory and map.
  • Figure 3: Context-balanced anchors. (i) Globally consistent reconstruction. (ii) Overlap and loop anchors for submap alignment.
  • Figure 4: Qualitative results on generalization benchmarks. We visualize the estimated trajectories on (a) 4Seasons, (b) Complex Urban, and (c) A2D2 datasets. Our method exhibits long-range consistency across various challenging scenarios, whereas baseline methods fail.
  • Figure 5: Runtime efficiency comparison on five different benchmarks. (a) Total runtime per scene (s) shown in log scale. (b) Average runtime per frame (s/frame).
  • ...and 16 more figures