Table of Contents
Fetching ...

Similarity-Aware Token Pruning: Your VLM but Faster

Ahmadreza Jeddi, Negin Baghbanzadeh, Elham Dolatabadi, Babak Taati

TL;DR

SAINT tackles the heavy computational burden of self-attention in ViTs and VLMs by introducing a training-free, similarity-aware token pruning framework. It models tokens as a graph, pruning redundant tokens based on a dynamic, voting-driven pruning rate and a redundancy-based ranking, with early-layer pruning yielding the largest gains. Across ViTs, SAINT achieves state-of-the-art accuracy-throughput trade-offs on ImageNet-1K, while in VLMs it delivers substantial latency reductions in text-agnostic, LLM-only, and hybrid configurations, notably dropping 75% of tokens in LLaVA-13B with less than 1% accuracy loss. The approach provides a unified, practical framework for efficient inference in both ViTs and VLMs, highlighting that early, similarity-driven pruning can outperform attention-focused or fixed-rate methods in real-world settings.

Abstract

The computational demands of Vision Transformers (ViTs) and Vision-Language Models (VLMs) remain a significant challenge due to the quadratic complexity of self-attention. While token pruning offers a promising solution, existing methods often introduce training overhead or fail to adapt dynamically across layers. We present SAINT, a training-free token pruning framework that leverages token similarity and a graph-based formulation to dynamically optimize pruning rates and redundancy thresholds. Through systematic analysis, we identify a universal three-stage token evolution process (aligner-explorer-aggregator) in transformers, enabling aggressive pruning in early stages without sacrificing critical information. For ViTs, SAINT doubles the throughput of ViT-H/14 at 224px with only 0.6% accuracy loss on ImageNet-1K, surpassing the closest competitor by 0.8%. For VLMs, we apply SAINT in three modes: ViT-only, LLM-only, and hybrid. SAINT reduces LLaVA-13B's tokens by 75%, achieving latency comparable to LLaVA-7B with less than 1% performance loss across benchmarks. Our work establishes a unified, practical framework for efficient inference in ViTs and VLMs.

Similarity-Aware Token Pruning: Your VLM but Faster

TL;DR

SAINT tackles the heavy computational burden of self-attention in ViTs and VLMs by introducing a training-free, similarity-aware token pruning framework. It models tokens as a graph, pruning redundant tokens based on a dynamic, voting-driven pruning rate and a redundancy-based ranking, with early-layer pruning yielding the largest gains. Across ViTs, SAINT achieves state-of-the-art accuracy-throughput trade-offs on ImageNet-1K, while in VLMs it delivers substantial latency reductions in text-agnostic, LLM-only, and hybrid configurations, notably dropping 75% of tokens in LLaVA-13B with less than 1% accuracy loss. The approach provides a unified, practical framework for efficient inference in both ViTs and VLMs, highlighting that early, similarity-driven pruning can outperform attention-focused or fixed-rate methods in real-world settings.

Abstract

The computational demands of Vision Transformers (ViTs) and Vision-Language Models (VLMs) remain a significant challenge due to the quadratic complexity of self-attention. While token pruning offers a promising solution, existing methods often introduce training overhead or fail to adapt dynamically across layers. We present SAINT, a training-free token pruning framework that leverages token similarity and a graph-based formulation to dynamically optimize pruning rates and redundancy thresholds. Through systematic analysis, we identify a universal three-stage token evolution process (aligner-explorer-aggregator) in transformers, enabling aggressive pruning in early stages without sacrificing critical information. For ViTs, SAINT doubles the throughput of ViT-H/14 at 224px with only 0.6% accuracy loss on ImageNet-1K, surpassing the closest competitor by 0.8%. For VLMs, we apply SAINT in three modes: ViT-only, LLM-only, and hybrid. SAINT reduces LLaVA-13B's tokens by 75%, achieving latency comparable to LLaVA-7B with less than 1% performance loss across benchmarks. Our work establishes a unified, practical framework for efficient inference in ViTs and VLMs.

Paper Structure

This paper contains 31 sections, 9 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Token dynamics for different vision transformer models. The figure presents 12 panels arranged in a 3 (model class: ViT, i.e. trained on ImageNet, DeiT touvron2021training, MAE he2022masked) by 4 (metrics: $\mathcal{S}$, $\mathcal{H}_{CLS}$, $\mathcal{H}_{Tokens}$, $\bar{A}_{CLS}$) grid. Each panel shows the corresponding metric plotted against the normalized depth, with curves representing different model sizes.
  • Figure 2: Impact of 4 pruning methods on model accuracy for 4 models on ImageNet. The subplots are arranged in 3 rows corresponding to different pruning strategies: (i) the top row prunes a large amount (r=40%) from only one layer, (ii) the middle row prunes a constant per-layer amount (r = 100/depth%) up to a specified layer, and (iii) the bottom row employs our voting-based method to determine the prune rate for the first half of the network layers. The baseline (no pruning) is included for reference.
  • Figure 3: Overview of SAINT: Inserted between attention and feed-forward blocks, SAINT models tokens as nodes in a bipartite graph with edges weighted by cosine similarity. It prunes redundant tokens via thresholding, batch-level voting, and redundancy-based ranking. SAINT integrates into transformer layers in vision encoders, LLMs, or both.
  • Figure 4: Accuracy/Throughput trade-off for SAINT and baseline methods across various ViT paradigms
  • Figure 5: The effect of text-agnostic pruning versus VisionZip on performance across three benchmarks using the LLaVa-1.5-7B model, plotted against the number of remaining tokens.
  • ...and 3 more figures