Table of Contents
Fetching ...

Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

Hongjie Wang, Bhishma Dedhia, Niraj K. Jha

TL;DR

This paper tackles the challenge of deploying large Vision Transformers on compute-constrained devices by enabling zero-shot token pruning without fine-tuning. It introduces Zero-TPrune, which leverages the attention graph of pre-trained transformers to compute token importance via Weighted Page Rank and then applies similarity-based pruning guided by an importance-driven partitioning. The method combines an I-stage (importance) with an S-stage (similarity) and uses techniques like Emphasizing Informative Region and Variance-based Head Filter to stabilize rankings, achieving substantial FLOPs reductions and throughput gains with minimal accuracy loss on ImageNet across multiple backbones. The approach outperforms both fine-tuning-required pruning methods and other fine-tuning-free baselines in most settings, highlighting its practical impact for edge-efficient transformer inference.

Abstract

Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exponentially growing inference cost that scales quadratically with the number of tokens in the input sequence. Token pruning is an emerging solution to address this challenge due to its ease of deployment on various Transformer backbones. However, most token pruning methods require computationally expensive fine-tuning, which is undesirable in many edge deployment cases. In this work, we propose Zero-TPrune, the first zero-shot method that considers both the importance and similarity of tokens in performing token pruning. It leverages the attention graph of pre-trained Transformer models to produce an importance distribution for tokens via our proposed Weighted Page Rank (WPR) algorithm. This distribution further guides token partitioning for efficient similarity-based pruning. Due to the elimination of the fine-tuning overhead, Zero-TPrune can prune large models at negligible computational cost, switch between different pruning configurations at no computational cost, and perform hyperparameter tuning efficiently. We evaluate the performance of Zero-TPrune on vision tasks by applying it to various vision Transformer backbones and testing them on ImageNet. Without any fine-tuning, Zero-TPrune reduces the FLOPs cost of DeiT-S by 34.7% and improves its throughput by 45.3% with only 0.4% accuracy loss. Compared with state-of-the-art pruning methods that require fine-tuning, Zero-TPrune not only eliminates the need for fine-tuning after pruning but also does so with only 0.1% accuracy loss. Compared with state-of-the-art fine-tuning-free pruning methods, Zero-TPrune reduces accuracy loss by up to 49% with similar FLOPs budgets. Project webpage: https://jha-lab.github.io/zerotprune.

Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

TL;DR

This paper tackles the challenge of deploying large Vision Transformers on compute-constrained devices by enabling zero-shot token pruning without fine-tuning. It introduces Zero-TPrune, which leverages the attention graph of pre-trained transformers to compute token importance via Weighted Page Rank and then applies similarity-based pruning guided by an importance-driven partitioning. The method combines an I-stage (importance) with an S-stage (similarity) and uses techniques like Emphasizing Informative Region and Variance-based Head Filter to stabilize rankings, achieving substantial FLOPs reductions and throughput gains with minimal accuracy loss on ImageNet across multiple backbones. The approach outperforms both fine-tuning-required pruning methods and other fine-tuning-free baselines in most settings, highlighting its practical impact for edge-efficient transformer inference.

Abstract

Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exponentially growing inference cost that scales quadratically with the number of tokens in the input sequence. Token pruning is an emerging solution to address this challenge due to its ease of deployment on various Transformer backbones. However, most token pruning methods require computationally expensive fine-tuning, which is undesirable in many edge deployment cases. In this work, we propose Zero-TPrune, the first zero-shot method that considers both the importance and similarity of tokens in performing token pruning. It leverages the attention graph of pre-trained Transformer models to produce an importance distribution for tokens via our proposed Weighted Page Rank (WPR) algorithm. This distribution further guides token partitioning for efficient similarity-based pruning. Due to the elimination of the fine-tuning overhead, Zero-TPrune can prune large models at negligible computational cost, switch between different pruning configurations at no computational cost, and perform hyperparameter tuning efficiently. We evaluate the performance of Zero-TPrune on vision tasks by applying it to various vision Transformer backbones and testing them on ImageNet. Without any fine-tuning, Zero-TPrune reduces the FLOPs cost of DeiT-S by 34.7% and improves its throughput by 45.3% with only 0.4% accuracy loss. Compared with state-of-the-art pruning methods that require fine-tuning, Zero-TPrune not only eliminates the need for fine-tuning after pruning but also does so with only 0.1% accuracy loss. Compared with state-of-the-art fine-tuning-free pruning methods, Zero-TPrune reduces accuracy loss by up to 49% with similar FLOPs budgets. Project webpage: https://jha-lab.github.io/zerotprune.
Paper Structure (33 sections, 8 equations, 20 figures, 14 tables, 1 algorithm)

This paper contains 33 sections, 8 equations, 20 figures, 14 tables, 1 algorithm.

Figures (20)

  • Figure 1: Comparing existing efficiency enhancement methods and Zero-TPrune. $\rho$ represents the retention ratio measured by FLOPS cost. Most existing methods require re-training of the model after deploying it; each different pruning configuration requires separate re-training of the model, which is extremely expensive. On the contrary, Zero-TPrune is training-free and can switch between different pruning configurations at no computational cost. This benefits from our graph-based algorithm exploiting correlations between image tokens.
  • Figure 2: The overall Zero-TPrune framework. Pruning layers can be inserted between Transformer blocks to reduce the number of tokens. Pruning layers comprise I-stage and S-stage: I-stage aims at pruning unimportant tokens of an image, such as background tokens (see (b)); S-stage aims at pruning tokens that are too similar to others, such as repetitive texture tokens (see (c)). A combination of the stages then maximally exploits token redundancy (see (d)).
  • Figure 3: Overview of the I-stage: (a) from a 4$\times$4 attention matrix to an attention graph and (b) graph signal transformation from initialization to convergence.
  • Figure 4: The importance-based pruning process in the S-stage. As an example, sequential partitioning (pruning unimportant part) is used in this figure.
  • Figure 5: Visualized examples of the pruning process conducted by Zero-TPrune. Images are randomly selected from ImageNet validation dataset. When the pruning rate is aggressive and the main object occupies most of the image area, it is not enough to only prune background tokens. Zero-TPrune exploits similarity between main object tokens and prunes redundant ones.
  • ...and 15 more figures