PPT: Token Pruning and Pooling for Efficient Vision Transformers

Xinjian Wu; Fanhu Zeng; Xiudong Wang; Xinghao Chen

PPT: Token Pruning and Pooling for Efficient Vision Transformers

Xinjian Wu, Fanhu Zeng, Xiudong Wang, Xinghao Chen

TL;DR

Vision Transformers often suffer from high computational costs due to dense token interactions. PPT introduces Adaptive Token Pruning & Pooling, a parameter-free framework that dynamically switches between pruning inattentive tokens and pooling similar tokens across layers and instances to reduce redundancy. By using variance-based policy decisions and no extra parameters, PPT delivers substantial FLOPs reductions (e.g., >37%) and throughput gains (e.g., >45%) on ImageNet with negligible accuracy loss for DeiT-S, and extends to other ViT variants. This work demonstrates the value of jointly addressing both inattentive and duplicative redundancies to enable more practical, scalable transformer-based vision models.

Abstract

Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks. However, the high computational complexity poses a significant barrier to their practical applications in real-world scenarios. Motivated by the fact that not all tokens contribute equally to the final predictions and fewer tokens bring less computational cost, reducing redundant tokens has become a prevailing paradigm for accelerating vision transformers. However, we argue that it is not optimal to either only reduce inattentive redundancy by token pruning, or only reduce duplicative redundancy by token merging. To this end, in this paper we propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT), to adaptively tackle these two types of redundancy in different layers. By heuristically integrating both token pruning and token pooling techniques in ViTs without additional trainable parameters, PPT effectively reduces the model complexity while maintaining its predictive accuracy. For example, PPT reduces over 37% FLOPs and improves the throughput by over 45% for DeiT-S without any accuracy drop on the ImageNet dataset. The code is available at https://github.com/xjwu1024/PPT and https://github.com/mindspore-lab/models/

PPT: Token Pruning and Pooling for Efficient Vision Transformers

TL;DR

Abstract

Paper Structure (16 sections, 3 equations, 15 figures, 7 tables)

This paper contains 16 sections, 3 equations, 15 figures, 7 tables.

Introduction
Related Work
Methodology
Overview
Token Pruning for Inattentive Redundancy
Token Pooling for Duplicative Redundancy
Token Pruning & Pooling Transformer
Experiments
Main Results
Ablation Study
Conclusion
Impact Statements
Deeper Analysis
Full Results
Extend Experiments
...and 1 more sections

Figures (15)

Figure 1: Visualizations of token compression results using different methods on the ImageNet with DeiT-S. (a) Original images. (b) Token pruning methods, which discard inattentive tokens. (c) Token pooling methods, which merge similar tokens within the same color bounding box. (d) Our method can effectively address both types of redundancy while achieving superior performance.
Figure 2: Overview of the proposed PPT approach. (a) The Adaptive Token Compression module is simple and can be easily inserted inside the standard transformer block without additional trainable parameter. (b) Our module can adaptively select either token pruning or token pooling policy to tackle corresponding redundancy based on the current token distribution, which is intuitively reflected across various instances and layers in (c). (c) With PPT, similar patches within the same color bounding box are pooled into a single token, while the masked inattentive patches are pruned, resulting in promising trade-offs between the accuracy and FLOPs.
Figure 2: Comparisons with different variants of ViTs on ImageNet. We compress the LV-ViT-S jiang2021all as the base model and achieve promising accuracy-FLOPs trade-off.
Figure 3: The scatter and the histogram of the variance of the significance scores assigned to image tokens at each layer of the DeiT-S model on the ImageNet validation set. The y-axis corresponds to the variance value and the x-axis to the index of samples in the dataset. We display the average variance of each layer at the top of each graph in red to track the trend of variance changes as the layers go deeper.
Figure 4: Comparison between our method and other methods under different FLOPs. We conducted a comprehensive comparison of the performance of various methods after fine-tuning (left) and off-the-shelf (right), which highlights the superiority of our method.
...and 10 more figures

PPT: Token Pruning and Pooling for Efficient Vision Transformers

TL;DR

Abstract

PPT: Token Pruning and Pooling for Efficient Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (15)