Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers
Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
TL;DR
This work tackles the memory scalability of streaming 4D vision transformers by introducing Evict3R, a training-free token eviction strategy that bounds the KV cache at inference time. It uses per-layer budgets and an attention-based importance score, with layer-wise budget allocation informed by attention sparsity, to selectively retain informative tokens while discarding redundancy. The approach achieves comparable accuracy to StreamVGGT on depth, reconstruction, and pose tasks while significantly reducing peak memory and enabling substantially longer sequences, thus making real-time streaming robotics more feasible. The method requires no retraining, preserves essential references, and integrates smoothly with existing transformer architectures, offering a practical path toward scalable streaming 3D perception.
Abstract
Streaming visual transformers like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key value (KV) memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.
