Table of Contents
Fetching ...

STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

Runze Wang, Yuxuan Song, Youcheng Cai, Ligang Liu

Abstract

Online 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. Although causal VGGT transformers address this challenge through a key-value (KV) cache mechanism, the cache grows linearly with the stream length, creating a major memory bottleneck. Under limited memory budgets, early cache eviction significantly degrades reconstruction quality and temporal consistency. In this work, we observe that attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity. Based on this insight, we propose STAC, a Spatio-Temporally Aware Cache Compression framework for streaming 3D reconstruction with large causal transformers. STAC consists of three key components: (1) a Working Temporal Token Caching mechanism that preserves long-term informative tokens using decayed cumulative attention scores; (2) a Long-term Spatial Token Caching scheme that compresses spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and (3) a Chunk-based Multi-frame Optimization strategy that jointly processes consecutive frames to improve temporal coherence and GPU efficiency. Extensive experiments show that STAC achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x, substantially improving the scalability of real-time 3D reconstruction in streaming settings.

STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

Abstract

Online 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. Although causal VGGT transformers address this challenge through a key-value (KV) cache mechanism, the cache grows linearly with the stream length, creating a major memory bottleneck. Under limited memory budgets, early cache eviction significantly degrades reconstruction quality and temporal consistency. In this work, we observe that attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity. Based on this insight, we propose STAC, a Spatio-Temporally Aware Cache Compression framework for streaming 3D reconstruction with large causal transformers. STAC consists of three key components: (1) a Working Temporal Token Caching mechanism that preserves long-term informative tokens using decayed cumulative attention scores; (2) a Long-term Spatial Token Caching scheme that compresses spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and (3) a Chunk-based Multi-frame Optimization strategy that jointly processes consecutive frames to improve temporal coherence and GPU efficiency. Extensive experiments show that STAC achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x, substantially improving the scalability of real-time 3D reconstruction in streaming settings.
Paper Structure (15 sections, 13 equations, 4 figures, 3 tables)

This paper contains 15 sections, 13 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Runtime--memory scaling in streaming 3D reconstruction. Bars show per-frame runtime (ms) and lines show KV cache memory (GB, log scale) as the stream grows. Compared with Causal-VGGT, STAC reduces KV cache growth and stabilizes per-frame runtime as the stream length increases.
  • Figure 2: Spatio-temporal attention sparsity. Representative attention patterns in the global causal attention map of Causal-VGGT, retaining the top 1024 keys per query for visualization. (a) Spatial attention aligned with camera motion; (b) Persistent focus on first-frame tokens as global references; (c) Temporal anchoring via tokens from semantically stable landmark frames; (d) Long-range attention to camera tokens encoding global context.
  • Figure 3: Overview of STAC. Our framework reconstructs 3D scenes online using spatio-temporal token caching and chunk-based causal inference. (a) The Causal-VGGT module processes ViT-tokenized frames in each chunk using causal attention over the working temporal cache $\mathcal{M}^{\text{temp}}$ and spatial cache $\mathcal{M}^{\text{spat}}$ retrieved from a 3D voxel grid. (b) During inference, Working Temporal Token Caching updates token scores after each KV cache access, retaining high-scoring anchor tokens $\mathcal{M}^{\text{anchor}}$ while preserving first-frame reference tokens $\mathcal{M}^{\text{refer}}$ and sliding-window tokens $\mathcal{M}^{\text{window}}$, and evicting the rest. (c) Long-term Spatial Token Caching routes evicted tokens with 3D coordinates from the Head Decoder to many-to-one aggregation in $\mathcal{E}$ or one-to-one merging into $\mathcal{G}$. When $\mathcal{G}$ is full, re-merging frees a slot for the incoming merged token, and the updated representations are stored in a 3D voxel grid for future retrieval.
  • Figure 4: Qualitative results on streaming inputs from the 7-scenes and NRGBD datasets.