Table of Contents
Fetching ...

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, Jin Xie

TL;DR

<3-5 sentence high-level summary>VGGT-Long tackles the memory and scalability barriers of foundation-model-based monocular 3D reconstruction for long outdoor sequences. It introduces a minimalist, chunk-based pipeline with overlapping local alignment, confidence-aware weighting, loop-closure via VPR-driven loop-centric chunks, and a global Sim(3) Levenberg–Marquardt optimization to maintain global consistency. Evaluated on KITTI, Waymo, and Virtual KITTI, it achieves kilometer-scale reconstructions without camera calibration or depth supervision and demonstrates robust performance where prior foundation-model approaches struggle or fail due to memory limits. The method highlights that a strong base model, paired with a simple yet effective chunk-and-align strategy, can scale to real-world, long-range 3D perception tasks.

Abstract

Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios. Code is available at https://github.com/DengKaiCQ/VGGT-Long.

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

TL;DR

<3-5 sentence high-level summary>VGGT-Long tackles the memory and scalability barriers of foundation-model-based monocular 3D reconstruction for long outdoor sequences. It introduces a minimalist, chunk-based pipeline with overlapping local alignment, confidence-aware weighting, loop-closure via VPR-driven loop-centric chunks, and a global Sim(3) Levenberg–Marquardt optimization to maintain global consistency. Evaluated on KITTI, Waymo, and Virtual KITTI, it achieves kilometer-scale reconstructions without camera calibration or depth supervision and demonstrates robust performance where prior foundation-model approaches struggle or fail due to memory limits. The method highlights that a strong base model, paired with a simple yet effective chunk-and-align strategy, can scale to real-world, long-range 3D perception tasks.

Abstract

Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios. Code is available at https://github.com/DengKaiCQ/VGGT-Long.

Paper Structure

This paper contains 13 sections, 5 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: For large-scale outdoor scenarios, previous work suffers from: 1) severe drift (CUT3R and Fast3R); 2) unable to complete the entire long sequence (MASt3R-SLAM and VGGT). Our method VGGT-Long is able to complete the reconstruction of the kilometer-scale scene while maintaining the accuracy of the scene.
  • Figure 2: Overview of VGGT-Long. VGGT-Long processes long sequences by dividing them into different chunks, thereby handling the input RGB stream in a sliding window manner. We fully utilize VGGT's pointmap and confidence to perform lightweight loop closure and alignment on the output chunks, thus extending VGGT to long-sequence datasets for autonomous driving.
  • Figure 3: (a) VGGT-long divides a kilometer-scale sequence into different chunks for processing. (b) The alignments are derived from the consistency of overlapping frames in 3D space.
  • Figure 4: Confidence-aware alignment suppress the influence of high-speed dynamic objects (such as vehicles) on alignment and reconstruction. It could be observed that higher-density vehicles cannot be effectively filtered out by the LiDAR, but VGGT-Long has the ability to handle this situation.
  • Figure 5: Without loop constraints, errors will be accumulated continuously at the kilometer scale. The use of Global LM Optimization can alleviate this accumulated error.
  • ...and 7 more figures