Table of Contents
Fetching ...

Faster VGGT with Block-Sparse Global Attention

Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, Bastian Leibe

TL;DR

<3-5 sentence high-level summary> Problem: transformer-based multi-view geometry methods like VGGT and π^3 suffer from quadratic global-attention costs that hinder scalability. Approach: perform a data-driven analysis showing attention sparsity and mid-layer dominance, then introduce a training-free block-sparse global attention that uses pooled Q/K to predict active blocks, keeping special-token interactions dense. Contributions: (1) sparsity analysis linking cross-view correspondences to mid-stack patch-patch attention, (2) a practical block-sparse retrofit achieving up to 4x inference speedups with comparable accuracy on VGGT and π^3, (3) extensive benchmarks across Real Estate 10K, CO3Dv2, Tanks & Temples, ETH3D, Seven Scenes, NRGBD, DTU, and ScanNet. Significance: enables scalable, efficient multi-view reconstruction without backbone retraining and is deployable on large image collections while preserving task performance.

Abstract

Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT and $π^3$ have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block-sparse kernels, yielding up to $4\times$ faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and $π^3$, and supports large image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.

Faster VGGT with Block-Sparse Global Attention

TL;DR

<3-5 sentence high-level summary> Problem: transformer-based multi-view geometry methods like VGGT and π^3 suffer from quadratic global-attention costs that hinder scalability. Approach: perform a data-driven analysis showing attention sparsity and mid-layer dominance, then introduce a training-free block-sparse global attention that uses pooled Q/K to predict active blocks, keeping special-token interactions dense. Contributions: (1) sparsity analysis linking cross-view correspondences to mid-stack patch-patch attention, (2) a practical block-sparse retrofit achieving up to 4x inference speedups with comparable accuracy on VGGT and π^3, (3) extensive benchmarks across Real Estate 10K, CO3Dv2, Tanks & Temples, ETH3D, Seven Scenes, NRGBD, DTU, and ScanNet. Significance: enables scalable, efficient multi-view reconstruction without backbone retraining and is deployable on large image collections while preserving task performance.

Abstract

Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT and have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block-sparse kernels, yielding up to faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and , and supports large image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.

Paper Structure

This paper contains 26 sections, 2 equations, 15 figures, 14 tables.

Figures (15)

  • Figure 1: Runtime of VGGT's forward pass. FA denotes frame-wise attention. As the number of input frames increases, global attention dominates the computational cost (measured with FlashAttention2 dao2023flashattentionV2 on an H100 GPU at resolution $518^2$). We propose to adapt a block-sparse attention method that considerably reduces the cost of Global Attention while preserving result quality.
  • Figure 2: Architecture overview of VGGT wang2025vggt. The key component is the Aggregator consisting of $L = 24$ alternating attention blocks (first frame-wise attention, then global attention over all frames). Each input frame is augmented with five learned embedding vectors: one camera token and four register tokens. After the Aggregator, VGGT regresses camera poses from the camera tokens using a light-weight MLP head, and dense outputs (point maps, depth, point tracks) using DPT heads ranftl2021dpt.
  • Figure 3: Visualization of VGGT's global attention matrix. A very small number of entries is highly activated, while the vast majority of entries is near zero. This visualization shows the average attention map over all heads of layer 15 in the VGGT aggregator, at an input resolution of $224\times 182$. Upper highlight: The special tokens attend to each other and form a distinctive pattern. Lower highlight: Patch-level attention is localized on a small subset of highly activated entries. See the supplementary material for an enlarged visualization.
  • Figure 4: VGGT's global attention matrix is extremely sparse. Left: We visualize the tokens corresponding to the top-k activated entries of the attention map of layer 15. Right: Average & maximum attention scores in the global attention maps; the shorthand {S,P}2{P,S} denotes attention between special (S) and patch (P) tokens. Layers in the middle of the aggregator exhibit higher activations and increased sparsity. Note the different scalings of the mean and max activations.
  • Figure 5: Influence of dropping global attention layers. We skip the computation of different global attention layers in the aggregator starting with the earliest (Front), last (Back), alternating (Front & Back), or from the middle layers (Middle), and evaluate pose estimation on CO3Dv2 reizenstein21co3d. The x-axis denotes the total number of skipped layers. The experiment shows that the model is especially sensitive to pruning of the center layers, and robust against pruning the early and late layers.
  • ...and 10 more figures