Similarity-Aware Token Pruning: Your VLM but Faster
Ahmadreza Jeddi, Negin Baghbanzadeh, Elham Dolatabadi, Babak Taati
TL;DR
SAINT tackles the heavy computational burden of self-attention in ViTs and VLMs by introducing a training-free, similarity-aware token pruning framework. It models tokens as a graph, pruning redundant tokens based on a dynamic, voting-driven pruning rate and a redundancy-based ranking, with early-layer pruning yielding the largest gains. Across ViTs, SAINT achieves state-of-the-art accuracy-throughput trade-offs on ImageNet-1K, while in VLMs it delivers substantial latency reductions in text-agnostic, LLM-only, and hybrid configurations, notably dropping 75% of tokens in LLaVA-13B with less than 1% accuracy loss. The approach provides a unified, practical framework for efficient inference in both ViTs and VLMs, highlighting that early, similarity-driven pruning can outperform attention-focused or fixed-rate methods in real-world settings.
Abstract
The computational demands of Vision Transformers (ViTs) and Vision-Language Models (VLMs) remain a significant challenge due to the quadratic complexity of self-attention. While token pruning offers a promising solution, existing methods often introduce training overhead or fail to adapt dynamically across layers. We present SAINT, a training-free token pruning framework that leverages token similarity and a graph-based formulation to dynamically optimize pruning rates and redundancy thresholds. Through systematic analysis, we identify a universal three-stage token evolution process (aligner-explorer-aggregator) in transformers, enabling aggressive pruning in early stages without sacrificing critical information. For ViTs, SAINT doubles the throughput of ViT-H/14 at 224px with only 0.6% accuracy loss on ImageNet-1K, surpassing the closest competitor by 0.8%. For VLMs, we apply SAINT in three modes: ViT-only, LLM-only, and hybrid. SAINT reduces LLaVA-13B's tokens by 75%, achieving latency comparable to LLaVA-7B with less than 1% performance loss across benchmarks. Our work establishes a unified, practical framework for efficient inference in ViTs and VLMs.
