Table of Contents
Fetching ...

HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT

Yongsung Kim, Wooseok Song, Jaihyun Lew, Hun Hwangbo, Jaehoon Lee, Sungroh Yoon

Abstract

Visual Geometry Grounded Transformer (VGGT) has advanced 3D vision, yet its global attention layers suffer from quadratic computational costs that hinder scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits headwise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approximates the Hessian with respect to two distinct error terms on a small calibration set. In the inference stage, we perform HeSS-Guided Sparsification, leveraging the pre-computed HeSS to reallocate the total attention budget-assigning denser attention to sensitive heads and sparser attention to more robust ones. We demonstrate that HeSS effectively captures head-wise sparsification sensitivity and empirically confirm that attention heads in the global attention layers exhibit heterogeneous sensitivity characteristics. Extensive experiments further show that our method effectively mitigates performance degradation under high sparsity, demonstrating strong robustness across varying sparsification levels. Code is available at https://github.com/libary753/HeSS.

HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT

Abstract

Visual Geometry Grounded Transformer (VGGT) has advanced 3D vision, yet its global attention layers suffer from quadratic computational costs that hinder scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits headwise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approximates the Hessian with respect to two distinct error terms on a small calibration set. In the inference stage, we perform HeSS-Guided Sparsification, leveraging the pre-computed HeSS to reallocate the total attention budget-assigning denser attention to sensitive heads and sparser attention to more robust ones. We demonstrate that HeSS effectively captures head-wise sparsification sensitivity and empirically confirm that attention heads in the global attention layers exhibit heterogeneous sensitivity characteristics. Extensive experiments further show that our method effectively mitigates performance degradation under high sparsity, demonstrating strong robustness across varying sparsification levels. Code is available at https://github.com/libary753/HeSS.

Paper Structure

This paper contains 50 sections, 15 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Comparison with SparseVGGT wang2025fastervggt. SparseVGGT suffers from severe performance degradation because it applies uniform sparsity across all heads, overly sparsifying sensitive ones. Our HeSS-Guided Sparsification preserves these highly sensitive heads, retaining performance even under high sparsity.
  • Figure 2: Pipeline Overview. Our pipeline consists of two stages: (a) Calibration stage, which computes Head Sensitivity Score (HeSS) of all attention heads in VGGT's Global Attention layers. The Hessian with respect to two errors, the camera pose error $e_\text{cam}$ and the point cloud error $e_\text{pc}$, is used to compute HeSS. HeSS is obtained from a calibration set, and these scores are fixed during the inference stage. (b) Inference stage, based on HeSS, different masking ratios are assigned throughout each attention head.
  • Figure 3: HeSS-Guided Budget Reallocation. The total budget ($C_{\text{total}}$) is obtained by summing baseline per-head budgets ($c_{h_n}$) (top). This budget is then reallocated based on HeSS scores (red). An iterative Budget Capping process redistributes surplus (orange) from heads exceeding their capacity to the remaining uncapped heads, yielding the final budget ($c_{h_n}^{\text{final}}$, right).
  • Figure 4: HeSS Distribution in VGGT. The horizontal axis represents the Global Attention (GA) layers from GA 1 to GA 21. Each layer contains two columns, corresponding to $\mathrm{HeSS}_\text{cam}$ (Cam) and $\mathrm{HeSS}_\text{pc}$ (PC). The vertical axis lists the attention heads (H1–H16). Darker colors indicate higher sensitivity.
  • Figure 5: Error visualization on DTU jensen2014dtu For each scene, we visualize predicted point clouds from VGGT wang2025vggt, SparseVGGT wang2025fastervggt, and our method at various sparsity levels. Points whose 3D error exceeds 5 mm from the DTU ground truth are highlighted in green. Our method produces fewer highlighted points than SparseVGGT, reflecting a more stable reconstruction as sparsity increases.
  • ...and 11 more figures