Table of Contents
Fetching ...

Quantized Visual Geometry Grounded Transformer

Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu

TL;DR

This work addresses the challenge of compressing billion-parameter Visual Geometry Grounded Transformers (VGGTs) via Post-Training Quantization (PTQ). It identifies two VGGT-specific obstacles: data-independent special tokens cause heavy-tailed activations and calibration is unstable due to multi-view data. It proposes QuantVGGT, combining Dual-Smoothed Fine-Grained Quantization (DSFQ) and Noise-Filtered Diverse Sampling (NFDS) to achieve robust PTQ at low bit-width. Experiments on Co3Dv2 and DTU show state-of-the-art performance at both W8A8 and W4A4, with a 3.7x memory reduction and 2.5x speedup, while preserving reconstruction accuracy above 98% of the full-precision model. The authors provide open-source code to facilitate practical deployment of VGGT quantization.

Abstract

Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7$\times$ memory reduction and 2.5$\times$ acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.

Quantized Visual Geometry Grounded Transformer

TL;DR

This work addresses the challenge of compressing billion-parameter Visual Geometry Grounded Transformers (VGGTs) via Post-Training Quantization (PTQ). It identifies two VGGT-specific obstacles: data-independent special tokens cause heavy-tailed activations and calibration is unstable due to multi-view data. It proposes QuantVGGT, combining Dual-Smoothed Fine-Grained Quantization (DSFQ) and Noise-Filtered Diverse Sampling (NFDS) to achieve robust PTQ at low bit-width. Experiments on Co3Dv2 and DTU show state-of-the-art performance at both W8A8 and W4A4, with a 3.7x memory reduction and 2.5x speedup, while preserving reconstruction accuracy above 98% of the full-precision model. The authors provide open-source code to facilitate practical deployment of VGGT quantization.

Abstract

Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7 memory reduction and 2.5 acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.

Paper Structure

This paper contains 23 sections, 2 theorems, 19 equations, 10 figures, 8 tables.

Key Result

Lemma 3.1

Due to the central limit effect, the distribution of values after Hadamard rotation tends to approximate a Gaussian, thereby smoothing the heavy-tailed distribution introduced by special tokens tseng2024quip.

Figures (10)

  • Figure 1: QuantVGGT effectively quantizes VGGT wang2025vggt to W4A4 without compromising visual quality while bringing 2.5$\times$ speedup and 3.7$\times$ compression.
  • Figure 2: Overview of proposed QuantVGGT.Top: Our proposed Dual-Smoothed Fine-Grained Quantization architecture. Bottom: Our proposed Noise-Filtered Diverse Sampling strategy.
  • Figure 3: The motivation and effect of Dual-Smoothed Fine-Grained Quantization. (a): Salient distribution of VGGT wang2025vggtframe_block 9. (b):Saliency of registered tokens. (c): Distribution after naive rotation. (d): Distribution after our dual-smooth. We provide more analysis in Appendix Sec. \ref{['sec:more_distribution']}.
  • Figure 4: The motivation and effect of Noise-Filtered Diverse Sampling. (a): Layer distribution of VGGT wang2025vggt. (b): Visualization of label-clustered. (c): Visualization of feature-clustered. (d): Visualization of our-clustered. We provide more analysis in Appendix Sec. \ref{['sec:more_sampling']}.
  • Figure 5: Ablation study on sample strategy.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Lemma 3.1
  • Theorem 3.2: Calibration sampling principle
  • proof