Table of Contents
Fetching ...

SPOT: Sparsification with Attention Dynamics via Token Relevance in Vision Transformers

Oded Schlesinger, Amirhossein Farzam, J. Matias Di Martino, Guillermo Sapiro

TL;DR

This work tackles the high computational cost of Vision Transformers by introducing SPOT, a modular token-sparsification framework that leverages token embeddings and cross-layer attention dynamics to predict token relevance. A lightweight Token Relevance Module ingests per-token features, including compact attention statistics across layers, and uses an MLP to produce a soft and then hard masking of tokens via differentiable Gumbel-Softmax sampling, guided by a multi-iteration retention schedule $\rho_k=\rho^k$. Empirical results on ImageNet-1K with DeiT and LV-ViT show up to 40% GFLOPS savings while preserving or slightly improving accuracy, with strong robustness to perturbations and good cross-domain transfer. The approach is compatible with hard and soft sparsification and demonstrates interpretability through attention-based token pruning aligned with semantic content, offering a practical path to efficient ViT deployment.

Abstract

While Vision Transformers (ViT) have demonstrated remarkable performance across diverse tasks, their computational demands are substantial, scaling quadratically with the number of processed tokens. Compact attention representations, reflecting token interaction distributions, can guide early detection and reduction of less salient tokens prior to attention computation. Motivated by this, we present SParsification with attentiOn dynamics via Token relevance (SPOT), a framework for early detection of redundant tokens within ViTs that leverages token embeddings, interactions, and attention dynamics across layers to infer token importance, resulting in a more context-aware and interpretable relevance detection process. SPOT informs token sparsification and facilitates the elimination of such tokens, improving computational efficiency without sacrificing performance. SPOT employs computationally lightweight predictors that can be plugged into various ViT architectures and learn to derive effective input-specific token prioritization across layers. Its versatile design supports a range of performance levels adaptable to varying resource constraints. Empirical evaluations demonstrate significant efficiency gains of up to 40% compared to standard ViTs, while maintaining or even improving accuracy. Code and models are available at https://github.com/odedsc/SPOT .

SPOT: Sparsification with Attention Dynamics via Token Relevance in Vision Transformers

TL;DR

This work tackles the high computational cost of Vision Transformers by introducing SPOT, a modular token-sparsification framework that leverages token embeddings and cross-layer attention dynamics to predict token relevance. A lightweight Token Relevance Module ingests per-token features, including compact attention statistics across layers, and uses an MLP to produce a soft and then hard masking of tokens via differentiable Gumbel-Softmax sampling, guided by a multi-iteration retention schedule . Empirical results on ImageNet-1K with DeiT and LV-ViT show up to 40% GFLOPS savings while preserving or slightly improving accuracy, with strong robustness to perturbations and good cross-domain transfer. The approach is compatible with hard and soft sparsification and demonstrates interpretability through attention-based token pruning aligned with semantic content, offering a practical path to efficient ViT deployment.

Abstract

While Vision Transformers (ViT) have demonstrated remarkable performance across diverse tasks, their computational demands are substantial, scaling quadratically with the number of processed tokens. Compact attention representations, reflecting token interaction distributions, can guide early detection and reduction of less salient tokens prior to attention computation. Motivated by this, we present SParsification with attentiOn dynamics via Token relevance (SPOT), a framework for early detection of redundant tokens within ViTs that leverages token embeddings, interactions, and attention dynamics across layers to infer token importance, resulting in a more context-aware and interpretable relevance detection process. SPOT informs token sparsification and facilitates the elimination of such tokens, improving computational efficiency without sacrificing performance. SPOT employs computationally lightweight predictors that can be plugged into various ViT architectures and learn to derive effective input-specific token prioritization across layers. Its versatile design supports a range of performance levels adaptable to varying resource constraints. Empirical evaluations demonstrate significant efficiency gains of up to 40% compared to standard ViTs, while maintaining or even improving accuracy. Code and models are available at https://github.com/odedsc/SPOT .

Paper Structure

This paper contains 22 sections, 20 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of the proposed SPOT framework. Top: current sparsification methods primarily rely on partial information from the current model layer—either token embeddings or attention maps—to determine token importance, potentially leading to suboptimal predictions. Middle: SPOT enhances this process by integrating compact representations of both token embeddings and attention maps and their dynamics from multiple ViT blocks (highlighted in green) to SPOT and select the most crucial input-specific information. Providing this multi-layer information facilitates the gradual pruning of less relevant tokens, improving efficiency while maintaining accuracy. Bottom: Example visualizations of the gradual relevance-driven token selection performed by SPOT.
  • Figure 2: Illustration of the proposed redundant token detection data flow. Our modular design enables effective redundant tokens identification within ViT-based architectures by plugging the proposed module (blue) into any of the model's self-attention layers (gray). At each stage, the method leverages a combination of token embeddings and attention map statistics across layers until the current one from each head, as described in Section \ref{['subsec:method']}. This provides the SPOT module with rich contextual information and a comprehensive representation of inter-token interactions and their dynamics throughout the model, enabling it to predict redundant tokens effectively.
  • Figure 3: Visualizations of the gradual redundant token detection performed by our proposed approach on DeiT-S on samples from ImageNet-1K validation set. Increasingly transparent masking shades indicate later detection. Tokens identified as more informative, and thereby retained, are well aligned with semantic image objects and visual features, pointing to SPOT's interpretability.
  • Figure 4: Performance of SPOT on the ImageNet-1K dataset. We evaluate SPOT classification accuracy across four different models: DeiT-T (top-left), DeiT-S (bottom-left) touvron2021training, LV-ViT-T (top-right), and LV-ViT-S (bottom-right) jiang2021all, under varying computational budgets quantified in GFLOPS, corresponding to different retention rates, set by $\rho$. As expected, our framework exhibits a trade-off between efficiency and performance, as higher computational budgets lead to higher accuracy.
  • Figure 5: Reduced token-derived information results.
  • ...and 2 more figures