Table of Contents
Fetching ...

Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

Yutian Chen, Yuheng Qiu, Ruogu Li, Ali Agha, Shayegan Omidshafiei, Jay Patrikar, Sebastian Scherer

TL;DR

Co-Me presents a training-free acceleration technique for visual geometric transformers by distilling a lightweight per-token confidence predictor and performing confidence-guided token merging to replace low-confidence tokens. The approach preserves high-confidence, geometrically critical regions while reducing the quadratic attention and MLP workload, achieving up to 11.3x speedups on VGGT and 7.2x on MapAnything with minimal degradation in depth, pose, and point cloud tasks. An efficient CUDA-based implementation with attention bias correction keeps overhead low and enables on-edge deployment, demonstrated by real-time-like performance on edge hardware. The method is generalizable across VGGT, StreamVGGT, and MapAnything and is complementary to other acceleration strategies, with potential extensions to streaming time and training-time integration.

Abstract

We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3\times$ and $7.2\times$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.

Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

TL;DR

Co-Me presents a training-free acceleration technique for visual geometric transformers by distilling a lightweight per-token confidence predictor and performing confidence-guided token merging to replace low-confidence tokens. The approach preserves high-confidence, geometrically critical regions while reducing the quadratic attention and MLP workload, achieving up to 11.3x speedups on VGGT and 7.2x on MapAnything with minimal degradation in depth, pose, and point cloud tasks. An efficient CUDA-based implementation with attention bias correction keeps overhead low and enables on-edge deployment, demonstrated by real-time-like performance on edge hardware. The method is generalizable across VGGT, StreamVGGT, and MapAnything and is complementary to other acceleration strategies, with potential extensions to streaming time and training-time integration.

Abstract

We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to and speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.

Paper Structure

This paper contains 24 sections, 8 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Co-Me accelerates visual geometric transformers by selectively merging low-confidence tokens guided by a distilled confidence predictor. When applied to VGGT and MapAnything, Co-Me achieves up to 11.3$\times$ and $7.2\times$ speedup without retraining or architectural changes to the ViT backbone, turning geometric transformers into real-time-capable models for 3D perception.
  • Figure 2: Overview of Co-Me. A lightweight module distilled from the frozen ViT backbone predicts per-token confidence from intermediate features. The predicted confidence is converted into a binary mask that guides token merging on the attention and MLP modules.
  • Figure 3: The proposed mask generation (left), merge (middle), and split (right) operators. Each sample generates an individual merge mask via confidence ranking and bottom-$p$ selection. A shared index map is used by merging and splitting, which aggregate (average or copy) and restore image tokens while preserving special tokens. Our custom CUDA kernel implementation supports merging masks with different shapes across samples in the batch as long as the number of merged tokens remains consistent.
  • Figure 4: Effect of attention bias correction. Merging tokens distorts the weight distribution after softmax operator without correction (left). Adding a bias term $\log n$ aligns the merged attention distribution with the original distribution (right).
  • Figure 5: Acceleration ratio of Co-Me-accelerated VGGT across sequence lengths. The speedup increases with sequence length and reaches up to $26.65\times$ when using a higher merge ratio $p=0.9$.
  • ...and 11 more figures