Table of Contents
Fetching ...

LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li, Zhiyuan Pu, Weize Li, Yao Yao, Xun Cao, Xiaoyang Guo, Xiao-Xiao Long

TL;DR

<3-5 sentence high-level summary> LiteVGGT addresses the scalability bottleneck of the Visual Geometry Grounded Transformer (VGGT) by introducing geometry-aware cached token merging, which reduces the number of tokens processed by the global attention and caches merge decisions across layers. It leverages a geometry-aware map that combines edge information and token variance to prioritize important tokens (GA Tokens) and anchors (Dst), merging redundant Src tokens into their nearest Dst tokens and unmerging for dense outputs. The approach is complemented by fine-tuning and FP8 quantization to further improve efficiency with minimal loss in reconstruction and pose accuracy. Extensive experiments across indoor and outdoor datasets demonstrate up to 10x speedups, substantial memory savings, and robust performance in 3D reconstruction and camera pose estimation, enabling large-scale scene processing that VGGT could not handle before.</n>

Abstract

3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token's geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT's core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT's effectiveness, scalability, and robustness. Project page: https://garlicba.github.io/LiteVGGT/

LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

TL;DR

<3-5 sentence high-level summary> LiteVGGT addresses the scalability bottleneck of the Visual Geometry Grounded Transformer (VGGT) by introducing geometry-aware cached token merging, which reduces the number of tokens processed by the global attention and caches merge decisions across layers. It leverages a geometry-aware map that combines edge information and token variance to prioritize important tokens (GA Tokens) and anchors (Dst), merging redundant Src tokens into their nearest Dst tokens and unmerging for dense outputs. The approach is complemented by fine-tuning and FP8 quantization to further improve efficiency with minimal loss in reconstruction and pose accuracy. Extensive experiments across indoor and outdoor datasets demonstrate up to 10x speedups, substantial memory savings, and robust performance in 3D reconstruction and camera pose estimation, enabling large-scale scene processing that VGGT could not handle before.</n>

Abstract

3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token's geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT's core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT's effectiveness, scalability, and robustness. Project page: https://garlicba.github.io/LiteVGGT/

Paper Structure

This paper contains 25 sections, 2 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: For 1000 input images, LiteVGGT achieves a 10× speedup over VGGT while maintaining high accuracy in camera pose and point cloud prediction. Its scalability and robustness make large-scale scene reconstruction more efficient and reliable.
  • Figure 2: Architecture Overview. We augment VGGT by placing a Geometry-aware Token Merging module on both sides of its global attention. Within GA-merge, tokens are partitioned by the GA map, grouped and merged to reduce redundancy, and after global attention the merged tokens are unmerged back to the original layout and passed to the subsequent frame-attention layers.
  • Figure 3: Latency analysis of the VGGT components. As the number of images increases, Global Attention gradually dominates the inference time.
  • Figure 4: Latency breakdown after introducing token merging. CM denotes merge index computation latency, which becomes a bottleneck for long sequences—addressed by our caching strategy.
  • Figure 5: Experiment about geometric cues. VGGT and DepthAnythingV2 yang2024depth still produce reasonable geometric results of input edge map.
  • ...and 8 more figures