Table of Contents
Fetching ...

OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, Yung-Yao Chen

TL;DR

OVGGT is presented, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length, and combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection.

Abstract

Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.

OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

TL;DR

OVGGT is presented, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length, and combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection.

Abstract

Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.
Paper Structure (22 sections, 8 equations, 8 figures, 10 tables)

This paper contains 22 sections, 8 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Streaming 3D on a single 32 GB GPU.Left: On 200-frame sequences shotton2013scenes, OVGGT outperforms all baselines in reconstruction quality, speed, and VRAM usage. Right: From 50 to 500 frames, StreamVGGT runs out of memory; other methods survive but suffer notable quality degradation. OVGGT maintains high-fidelity reconstructions at lower cost.
  • Figure 2: Overview of OVGGT. At each time step, the input frame is encoded into tokens and processed by a spatial-temporal decoder that attends to a bounded KV cache. During inference, the Activation Value Rating module scores each token's geometric salience, and the KV Cache Compression (KVCC) module evicts low-scoring tokens to maintain a fixed cache budget. Dynamic Anchor Protection (DAP) shields coordinate-critical tokens from eviction, ensuring long-range geometric stability.
  • Figure 3: Per-token FFN activation scores across layers, progressing from high-frequency textures (shallow) to geometric structures (mid) to semantic boundaries (deep).
  • Figure 4: Activation smoothing effectively improves reconstruction quality over vanilla token retention.
  • Figure 5: Qualitative comparison on indoor scene reconstruction (sequence length $= 500$). Each row shows a different scene with close-up insets. Note that StreamVGGT is limited to a maximum of 200 input frames due to memory constraints.
  • ...and 3 more figures