Table of Contents
Fetching ...

SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes

Jungho Lee, Minhyeok Lee, Sunghun Yang, Minseok Kang, Sangyoun Lee

TL;DR

SwiftVGGT tackles the speed-accuracy trade-off in kilometer-scale dense 3D reconstruction by offering a training-free pipeline. It introduces a reliability-guided point sampling scheme to enable a single-step $Sim(3)$ alignment via Umeyama and a VPR-free loop-closure mechanism that leverages VGGT's DINO tokens, followed by a global $Sim(3)$ optimization with Levenberg–Marquardt. The approach delivers state-of-the-art reconstruction quality while achieving over a 3x reduction in inference time without extra memory overhead, validated on KITTI, Waymo, and Virtual KITTI. A primary limitation is the lack of bundle adjustment, suggesting future integration of optimization-based pose refinement for further drift correction.

Abstract

3D reconstruction in large-scale scenes is a fundamental task in 3D perception, but the inherent trade-off between accuracy and computational efficiency remains a significant challenge. Existing methods either prioritize speed and produce low-quality results, or achieve high-quality reconstruction at the cost of slow inference times. In this paper, we propose SwiftVGGT, a training-free method that significantly reduce inference time while preserving high-quality dense 3D reconstruction. To maintain global consistency in large-scale scenes, SwiftVGGT performs loop closure without relying on the external Visual Place Recognition (VPR) model. This removes redundant computation and enables accurate reconstruction over kilometer-scale environments. Furthermore, we propose a simple yet effective point sampling method to align neighboring chunks using a single Sim(3)-based Singular Value Decomposition (SVD) step. This eliminates the need for the Iteratively Reweighted Least Squares (IRLS) optimization commonly used in prior work, leading to substantial speed-ups. We evaluate SwiftVGGT on multiple datasets and show that it achieves state-of-the-art reconstruction quality while requiring only 33% of the inference time of recent VGGT-based large-scale reconstruction approaches.

SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes

TL;DR

SwiftVGGT tackles the speed-accuracy trade-off in kilometer-scale dense 3D reconstruction by offering a training-free pipeline. It introduces a reliability-guided point sampling scheme to enable a single-step alignment via Umeyama and a VPR-free loop-closure mechanism that leverages VGGT's DINO tokens, followed by a global optimization with Levenberg–Marquardt. The approach delivers state-of-the-art reconstruction quality while achieving over a 3x reduction in inference time without extra memory overhead, validated on KITTI, Waymo, and Virtual KITTI. A primary limitation is the lack of bundle adjustment, suggesting future integration of optimization-based pose refinement for further drift correction.

Abstract

3D reconstruction in large-scale scenes is a fundamental task in 3D perception, but the inherent trade-off between accuracy and computational efficiency remains a significant challenge. Existing methods either prioritize speed and produce low-quality results, or achieve high-quality reconstruction at the cost of slow inference times. In this paper, we propose SwiftVGGT, a training-free method that significantly reduce inference time while preserving high-quality dense 3D reconstruction. To maintain global consistency in large-scale scenes, SwiftVGGT performs loop closure without relying on the external Visual Place Recognition (VPR) model. This removes redundant computation and enables accurate reconstruction over kilometer-scale environments. Furthermore, we propose a simple yet effective point sampling method to align neighboring chunks using a single Sim(3)-based Singular Value Decomposition (SVD) step. This eliminates the need for the Iteratively Reweighted Least Squares (IRLS) optimization commonly used in prior work, leading to substantial speed-ups. We evaluate SwiftVGGT on multiple datasets and show that it achieves state-of-the-art reconstruction quality while requiring only 33% of the inference time of recent VGGT-based large-scale reconstruction approaches.

Paper Structure

This paper contains 38 sections, 9 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: SwiftVGGT processes thousands of input images by dividing them into sliding-window chunks through VGGT vggt. To reduce inference time, we eliminate the IRLS optimization step by applying reliability-guided point sampling. Furthermore, we utilize the patch tokens obtained from the VGGT encoder dinov2 for loop detection directly, which further decreases the overall inference cost.
  • Figure 2: Comparison of dense 3D reconstruction on KITTI scenes containing loops. The pseudo ground-truth point cloud is obtained by combining the LiDAR point cloud with camera poses either provided by the dataset or estimated by PIN-SLAM pinslam. The gray dashed line represents the ground-truth trajectory, while the green solid line denotes the estimated trajectory.
  • Figure 3: Failure cases of KITTI scenes. The pseudo ground-truth point cloud is obtained by combining the LiDAR point cloud with camera poses either provided by the dataset or estimated by PIN-SLAM pinslam. The gray dashed line represents the ground-truth trajectory, while the green solid line denotes the estimated trajectory.
  • Figure 4: Point cloud visualization of KITTI dataset kitti.
  • Figure 5: Point cloud visualization of KITTI dataset kitti.
  • ...and 2 more figures