Table of Contents
Fetching ...

XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression

Zunhai Su, Weihao Ye, Hansen Feng, Keyu Fan, Jing Zhang, Dahai Yu, Zhengwu Liu, Ngai Wong

TL;DR

XStreamVGGT is proposed, a tuning-free approach that systematically compresses the KV cache through joint pruning and quantization, enabling extremely memory-efficient streaming inference, and incorporates KV quantization to further reduce memory consumption.

Abstract

Learning-based 3D visual geometry models have significantly advanced with the advent of large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention to deliver robust and efficient streaming 3D reconstruction. However, it suffers from unbounded growth in the Key-Value (KV) cache due to the massive influx of vision tokens from multi-image and long-video inputs, leading to increased memory consumption and inference latency as input frames accumulate. This ultimately limits its scalability for long-horizon applications. To address this gap, we propose XStreamVGGT, a tuning-free approach that seamlessly integrates pruning and quantization to systematically compress the KV cache, enabling extremely memory-efficient streaming inference. Specifically, redundant KVs generated from multi-frame inputs are initially pruned to conform to a fixed KV memory budget using an efficient token-importance identification mechanism that maintains full compatibility with high-performance attention kernels (e.g., FlashAttention). Additionally, leveraging the inherent distribution patterns of KV tensors, we apply dimension-adaptive KV quantization within the pruning pipeline to further minimize memory overhead while preserving numerical accuracy. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42$\times$ and accelerating inference by 5.48$\times$, enabling practical and scalable streaming 3D applications. The code is available at https://github.com/ywh187/XStreamVGGT/.

XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression

TL;DR

XStreamVGGT is proposed, a tuning-free approach that systematically compresses the KV cache through joint pruning and quantization, enabling extremely memory-efficient streaming inference, and incorporates KV quantization to further reduce memory consumption.

Abstract

Learning-based 3D visual geometry models have significantly advanced with the advent of large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention to deliver robust and efficient streaming 3D reconstruction. However, it suffers from unbounded growth in the Key-Value (KV) cache due to the massive influx of vision tokens from multi-image and long-video inputs, leading to increased memory consumption and inference latency as input frames accumulate. This ultimately limits its scalability for long-horizon applications. To address this gap, we propose XStreamVGGT, a tuning-free approach that seamlessly integrates pruning and quantization to systematically compress the KV cache, enabling extremely memory-efficient streaming inference. Specifically, redundant KVs generated from multi-frame inputs are initially pruned to conform to a fixed KV memory budget using an efficient token-importance identification mechanism that maintains full compatibility with high-performance attention kernels (e.g., FlashAttention). Additionally, leveraging the inherent distribution patterns of KV tensors, we apply dimension-adaptive KV quantization within the pruning pipeline to further minimize memory overhead while preserving numerical accuracy. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42 and accelerating inference by 5.48, enabling practical and scalable streaming 3D applications. The code is available at https://github.com/ywh187/XStreamVGGT/.
Paper Structure (18 sections, 12 equations, 7 figures, 7 tables)

This paper contains 18 sections, 12 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Efficiency analysis on a single 80GB A100 GPU. As the number of input frames increases, StreamVGGT and VGGT exhibit significant FPS degradation and rapidly encounter out-of-memory (OOM) errors. In contrast, XStreamVGGT consistently delivers significantly higher frames per second (FPS) without encountering OOM issues.
  • Figure 2: Overview of XStreamVGGT. Upon receiving a new input frame (Step 1), Queries from the global attention layer are aggregated via average pooling to form a compact representation, which is then matched against the Key to estimate token importance (Step 2). Guided by these Key-derived importance scores, low-importance historical KV pairs are selectively pruned, while KVs from the first frame are explicitly retained to preserve geometric consistency. The remaining high-importance KVs are concatenated with the first-frame KVs and the newly generated KVs from the current frame (Step 3). Following pruning, the KV cache is further compressed using dimension-adaptive quantization, employing per-channel Key quantization and per-token Value quantization to reduce the impact of outlier channels on quantization accuracy (Step 4). This results in a compact KV cache for efficient subsequent updates (Step 5).
  • Figure 3: Attention sparsity analysis. The visualization shows attention heatmaps from Layer 14 of StreamVGGT. The visualization of attention heatmaps reveals that attention weights are predominantly concentrated on Query-relevant regions. In contrast, other areas exhibit significantly lower attention, indicating substantial redundancy in the feature representation.
  • Figure 4: Magnitude distributions of the Key and Value. The Key demonstrates significant channel-wise outliers, with a small subset of channels exhibiting magnitudes substantially larger than the others. In contrast, the distribution of the Value is more uniform, with no prominent outlier behavior.
  • Figure 5: Ablation study of cache length and analysis of memory with increasing frame length.
  • ...and 2 more figures