Faster VGGT with Block-Sparse Global Attention
Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, Bastian Leibe
TL;DR
<3-5 sentence high-level summary> Problem: transformer-based multi-view geometry methods like VGGT and π^3 suffer from quadratic global-attention costs that hinder scalability. Approach: perform a data-driven analysis showing attention sparsity and mid-layer dominance, then introduce a training-free block-sparse global attention that uses pooled Q/K to predict active blocks, keeping special-token interactions dense. Contributions: (1) sparsity analysis linking cross-view correspondences to mid-stack patch-patch attention, (2) a practical block-sparse retrofit achieving up to 4x inference speedups with comparable accuracy on VGGT and π^3, (3) extensive benchmarks across Real Estate 10K, CO3Dv2, Tanks & Temples, ETH3D, Seven Scenes, NRGBD, DTU, and ScanNet. Significance: enables scalable, efficient multi-view reconstruction without backbone retraining and is deployable on large image collections while preserving task performance.
Abstract
Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT and $π^3$ have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block-sparse kernels, yielding up to $4\times$ faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and $π^3$, and supports large image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.
