Table of Contents
Fetching ...

Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers

Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi

TL;DR

This work tackles the memory scalability of streaming 4D vision transformers by introducing Evict3R, a training-free token eviction strategy that bounds the KV cache at inference time. It uses per-layer budgets and an attention-based importance score, with layer-wise budget allocation informed by attention sparsity, to selectively retain informative tokens while discarding redundancy. The approach achieves comparable accuracy to StreamVGGT on depth, reconstruction, and pose tasks while significantly reducing peak memory and enabling substantially longer sequences, thus making real-time streaming robotics more feasible. The method requires no retraining, preserves essential references, and integrates smoothly with existing transformer architectures, offering a practical path toward scalable streaming 3D perception.

Abstract

Streaming visual transformers like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key value (KV) memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.

Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers

TL;DR

This work tackles the memory scalability of streaming 4D vision transformers by introducing Evict3R, a training-free token eviction strategy that bounds the KV cache at inference time. It uses per-layer budgets and an attention-based importance score, with layer-wise budget allocation informed by attention sparsity, to selectively retain informative tokens while discarding redundancy. The approach achieves comparable accuracy to StreamVGGT on depth, reconstruction, and pose tasks while significantly reducing peak memory and enabling substantially longer sequences, thus making real-time streaming robotics more feasible. The method requires no retraining, preserves essential references, and integrates smoothly with existing transformer architectures, offering a practical path toward scalable streaming 3D perception.

Abstract

Streaming visual transformers like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key value (KV) memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.

Paper Structure

This paper contains 23 sections, 7 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Qualitative 3D pointmap comparison. Reconstructions from StreamVGGT and our method with $B{=}0.2$. Despite using less than half the memory, our method shows no visible drop in quality.
  • Figure 2: Overview of our token eviction framework. (A) Tokens in each layer’s KV cache are ranked by attention importance and evicted under a per-layer budget. (B) Importance scores are derived from query–key attention statistics. (C) Scores are normalized to guide consistent token selection across layers.
  • Figure 3: Head-averaged attention maps across temporal layers in StreamVGGTZhuo2025Streaming4V. Each subplot corresponds to one layer, averaged across all attention heads. Color scales are independent per subplot (not comparable across layers). For better visibility, rows are multiplied by their frame index $r$, since later frames have more tokens and lower raw attention values. White dashed lines mark frame boundaries along both axes.
  • Figure 4: Token eviction masks for first layer across frames 2–5 of an 11-frame sequence. Each subplot shows which tokens are retained or evicted at that frame. Blue-bordered squares indicate evicted tokens.