VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency
Zhuang Xiong, Chen Zhang, Qingshan Xu, Wenbing Tao
TL;DR
VGGT-Motion addresses scale drift and geometric fragmentation in calibration-free monocular SLAM on long sequences by combining motion-aware submap construction with anchor-driven direct Sim(3) registration and a lightweight submap-level pose graph. The method partitions sequences into motion regimes, provides dense pixel-indexed correspondences via context-balanced anchors, and optimizes a sparse Sim(3) backend, achieving linear complexity. It delivers substantial improvements on zero-shot, kilometer-scale benchmarks (e.g., 4Seasons, Complex Urban, A2D2) with 85–95% reductions in ATE and Drift relative to previous VGGT variants, while attaining 18–36× speedups over state-of-the-art foundation-model-based SLAM. This work demonstrates a shift toward motion-aware representations and scalable, calibration-free SLAM suitable for unconstrained outdoor deployment, with potential integration of alternative foundation models and renderable scene representations.
Abstract
Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.
