Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers
Yutian Chen, Yuheng Qiu, Ruogu Li, Ali Agha, Shayegan Omidshafiei, Jay Patrikar, Sebastian Scherer
TL;DR
Co-Me presents a training-free acceleration technique for visual geometric transformers by distilling a lightweight per-token confidence predictor and performing confidence-guided token merging to replace low-confidence tokens. The approach preserves high-confidence, geometrically critical regions while reducing the quadratic attention and MLP workload, achieving up to 11.3x speedups on VGGT and 7.2x on MapAnything with minimal degradation in depth, pose, and point cloud tasks. An efficient CUDA-based implementation with attention bias correction keeps overhead low and enables on-edge deployment, demonstrated by real-time-like performance on edge hardware. The method is generalizable across VGGT, StreamVGGT, and MapAnything and is complementary to other acceleration strategies, with potential extensions to streaming time and training-time integration.
Abstract
We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3\times$ and $7.2\times$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.
