Table of Contents
Fetching ...

FrameVGGT: Frame Evidence Rolling Memory for streaming VGGT

Zhisong Xu, Takeshi Oishi

TL;DR

This work proposes FrameVGGT, a frame-driven rolling explicit-memory framework that treats each frame's incremental KV contribution as a coherent evidence block and achieves favorable accuracy--memory trade-offs under bounded memory, while maintaining more stable geometry over long streams.

Abstract

Streaming Visual Geometry Transformers such as StreamVGGT enable strong online 3D perception but suffer from unbounded KV-cache growth, which limits deployment over long streams. We revisit bounded-memory streaming from the perspective of geometric support. In geometry-driven reasoning, memory quality depends not only on how many tokens are retained, but also on whether the retained memory still preserves sufficiently coherent local support. This suggests that token-level retention may become less suitable under fixed budgets, as it can thin the evidence available within each contributing frame and make subsequent fusion more sensitive to weakly aligned history. Motivated by this observation, we propose FrameVGGT, a frame-driven rolling explicit-memory framework that treats each frame's incremental KV contribution as a coherent evidence block. FrameVGGT summarizes each block into a compact prototype and maintains a fixed-capacity mid-term bank of complementary frame blocks under strict budgets, with an optional lightweight anchor tier for rare prolonged degradation. Across long-sequence 3D reconstruction, video depth estimation, and camera pose benchmarks, FrameVGGT achieves favorable accuracy--memory trade-offs under bounded memory, while maintaining more stable geometry over long streams.

FrameVGGT: Frame Evidence Rolling Memory for streaming VGGT

TL;DR

This work proposes FrameVGGT, a frame-driven rolling explicit-memory framework that treats each frame's incremental KV contribution as a coherent evidence block and achieves favorable accuracy--memory trade-offs under bounded memory, while maintaining more stable geometry over long streams.

Abstract

Streaming Visual Geometry Transformers such as StreamVGGT enable strong online 3D perception but suffer from unbounded KV-cache growth, which limits deployment over long streams. We revisit bounded-memory streaming from the perspective of geometric support. In geometry-driven reasoning, memory quality depends not only on how many tokens are retained, but also on whether the retained memory still preserves sufficiently coherent local support. This suggests that token-level retention may become less suitable under fixed budgets, as it can thin the evidence available within each contributing frame and make subsequent fusion more sensitive to weakly aligned history. Motivated by this observation, we propose FrameVGGT, a frame-driven rolling explicit-memory framework that treats each frame's incremental KV contribution as a coherent evidence block. FrameVGGT summarizes each block into a compact prototype and maintains a fixed-capacity mid-term bank of complementary frame blocks under strict budgets, with an optional lightweight anchor tier for rare prolonged degradation. Across long-sequence 3D reconstruction, video depth estimation, and camera pose benchmarks, FrameVGGT achieves favorable accuracy--memory trade-offs under bounded memory, while maintaining more stable geometry over long streams.
Paper Structure (54 sections, 1 theorem, 31 equations, 13 figures, 8 tables)

This paper contains 54 sections, 1 theorem, 31 equations, 13 figures, 8 tables.

Key Result

Proposition A.1

Under the global constraint eq:budget, In particular, for fixed $M$, the average retained tokens per frame vanish as $T \rightarrow \infty$.

Figures (13)

  • Figure 1: Memory organization strategies for streaming VGGT under a fixed budget. StreamVGGT keeps the full cache, InfiniteVGGT applies token-level bounded retention, and FrameVGGT applies support-aligned frame/block-level bounded retention.
  • Figure 2: Pipeline of FrameVGGT. Previous inputs are encoded to form per-layer KV blocks, which are managed by a middle-term pool using a distance-based greedy selection policy, and an optional anchor pool gated by gap and geometry thresholds. The selected cache is loaded to condition new inputs for streaming inference.
  • Figure 3: Reconstruction visualization comparison on the 7scenes dataset. InfiniteVGGT exhibits some floats over extended sequences, while our method maintains a more stable result.
  • Figure 4: Pose visualization comparison on the TUM dataset. InfiniteVGGT exhibits noticeable long-horizon drift over extended sequences, while our method maintains a more stable trajectory estimation.
  • Figure 5: Visualization of memory-key heatmaps at different timesteps. Left: InfiniteVGGT. Right: FrameVGGT. Heatmaps visualize saved memory key snapshots at different checkpoints. Rows are retained memory tokens (internal memory order) and columns are key dimensions (one layer/head slice; batch=0 and head=0 when applicable). Color encodes the key value; the colorbar is auto-scaled per snapshot (ticks chosen automatically), so absolute intensities are not directly comparable across checkpoints. Banded patterns indicate groups of tokens with similar key profiles (higher row-wise correlation), while diffuse patterns indicate more heterogeneous keys.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Proposition A.1: Average retained tokens under bounded memory
  • proof