OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

Si-Yu Lu; Po-Ting Chen; Hui-Che Hsu; Sin-Ye Jhong; Wen-Huang Cheng; Yung-Yao Chen

OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, Yung-Yao Chen

TL;DR

OVGGT is presented, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length, and combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection.

Abstract

Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.

OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

TL;DR

Abstract

Paper Structure (22 sections, 8 equations, 8 figures, 10 tables)

This paper contains 22 sections, 8 equations, 8 figures, 10 tables.

Introduction
Related Work
Classical Geometric Reconstruction
Geometric Foundation Models
Method
Preliminaries and Bottlenecks
Self-Selective Caching
Dynamic Anchor Protection
Experiments
3D Reconstruction
Video Depth Estimation
Inference Efficiency
Ablation Studies
Conclusion
Comparison with Full-Cache Baseline
...and 7 more sections

Figures (8)

Figure 1: Streaming 3D on a single 32 GB GPU.Left: On 200-frame sequences shotton2013scenes, OVGGT outperforms all baselines in reconstruction quality, speed, and VRAM usage. Right: From 50 to 500 frames, StreamVGGT runs out of memory; other methods survive but suffer notable quality degradation. OVGGT maintains high-fidelity reconstructions at lower cost.
Figure 2: Overview of OVGGT. At each time step, the input frame is encoded into tokens and processed by a spatial-temporal decoder that attends to a bounded KV cache. During inference, the Activation Value Rating module scores each token's geometric salience, and the KV Cache Compression (KVCC) module evicts low-scoring tokens to maintain a fixed cache budget. Dynamic Anchor Protection (DAP) shields coordinate-critical tokens from eviction, ensuring long-range geometric stability.
Figure 3: Per-token FFN activation scores across layers, progressing from high-frequency textures (shallow) to geometric structures (mid) to semantic boundaries (deep).
Figure 4: Activation smoothing effectively improves reconstruction quality over vanilla token retention.
Figure 5: Qualitative comparison on indoor scene reconstruction (sequence length $= 500$). Each row shows a different scene with close-up insets. Note that StreamVGGT is limited to a maximum of 200 input frames due to memory constraints.
...and 3 more figures

OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

TL;DR

Abstract

OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (8)