Table of Contents
Fetching ...

Efficient Token Compression for Vision Transformer with Spatial Information Preserved

Junzhu Mao, Yang Shen, Jinyang Guo, Yazhou Yao, Xiansheng Hua

TL;DR

PM-ViT presents a hardware-friendly approach to compressing Vision Transformer tokens by jointly pruning and merging within each transformer block. The method introduces a Prune and Merge module with a learnable merge matrix and a reconstruct matrix, plus shortcut connections to preserve information from pruned tokens, enabling layer-wise compression with minimal overhead. A gradient-weighted attention scoring mechanism derives token importance during training, eliminating the need for separate inference-time scoring and guiding the construction of the merge and reconstruction matrices. Additionally, a global compression strategy uses gradient information to identify near-winning-ticket structures, followed by finetuning with self-distillation to recover accuracy. Experiments on ImageNet-1k and ADE20K demonstrate substantial speed-ups (e.g., up to 1.64× on DeiT-Small) with negligible accuracy loss and robust performance across classification and semantic segmentation tasks, with the code and models publicly available.

Abstract

Token compression is essential for reducing the computational and memory requirements of transformer models, enabling their deployment in resource-constrained environments. In this work, we propose an efficient and hardware-compatible token compression method called Prune and Merge. Our approach integrates token pruning and merging operations within transformer models to achieve layer-wise token compression. By introducing trainable merge and reconstruct matrices and utilizing shortcut connections, we efficiently merge tokens while preserving important information and enabling the restoration of pruned tokens. Additionally, we introduce a novel gradient-weighted attention scoring mechanism that computes token importance scores during the training phase, eliminating the need for separate computations during inference and enhancing compression efficiency. We also leverage gradient information to capture the global impact of tokens and automatically identify optimal compression structures. Extensive experiments on the ImageNet-1k and ADE20K datasets validate the effectiveness of our approach, achieving significant speed-ups with minimal accuracy degradation compared to state-of-the-art methods. For instance, on DeiT-Small, we achieve a 1.64$\times$ speed-up with only a 0.2\% drop in accuracy on ImageNet-1k. Moreover, by compressing segmenter models and comparing with existing methods, we demonstrate the superior performance of our approach in terms of efficiency and effectiveness. Code and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/prune_and_merge.

Efficient Token Compression for Vision Transformer with Spatial Information Preserved

TL;DR

PM-ViT presents a hardware-friendly approach to compressing Vision Transformer tokens by jointly pruning and merging within each transformer block. The method introduces a Prune and Merge module with a learnable merge matrix and a reconstruct matrix, plus shortcut connections to preserve information from pruned tokens, enabling layer-wise compression with minimal overhead. A gradient-weighted attention scoring mechanism derives token importance during training, eliminating the need for separate inference-time scoring and guiding the construction of the merge and reconstruction matrices. Additionally, a global compression strategy uses gradient information to identify near-winning-ticket structures, followed by finetuning with self-distillation to recover accuracy. Experiments on ImageNet-1k and ADE20K demonstrate substantial speed-ups (e.g., up to 1.64× on DeiT-Small) with negligible accuracy loss and robust performance across classification and semantic segmentation tasks, with the code and models publicly available.

Abstract

Token compression is essential for reducing the computational and memory requirements of transformer models, enabling their deployment in resource-constrained environments. In this work, we propose an efficient and hardware-compatible token compression method called Prune and Merge. Our approach integrates token pruning and merging operations within transformer models to achieve layer-wise token compression. By introducing trainable merge and reconstruct matrices and utilizing shortcut connections, we efficiently merge tokens while preserving important information and enabling the restoration of pruned tokens. Additionally, we introduce a novel gradient-weighted attention scoring mechanism that computes token importance scores during the training phase, eliminating the need for separate computations during inference and enhancing compression efficiency. We also leverage gradient information to capture the global impact of tokens and automatically identify optimal compression structures. Extensive experiments on the ImageNet-1k and ADE20K datasets validate the effectiveness of our approach, achieving significant speed-ups with minimal accuracy degradation compared to state-of-the-art methods. For instance, on DeiT-Small, we achieve a 1.64 speed-up with only a 0.2\% drop in accuracy on ImageNet-1k. Moreover, by compressing segmenter models and comparing with existing methods, we demonstrate the superior performance of our approach in terms of efficiency and effectiveness. Code and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/prune_and_merge.

Paper Structure

This paper contains 16 sections, 9 equations, 7 figures, 12 tables, 2 algorithms.

Figures (7)

  • Figure 1: Our paper compares three token compression paradigms: (a) the regular token pruning, (b) the regular token merging operation, and (c) our Prune and Merge compression method. In contrast to (a) and (b), our proposed method seamlessly integrates token pruning and merging operations. Additionally, we reconstruct tokens in the transformer block output using a reconstruct matrix (labeled as "Reco Matrix" in the figure) and shortcut connections, enabling efficient layer-wise compression.
  • Figure 2: The architecture of our proposed approach.The left part illustrates the calculation process of the importance scores. We utilize the attention map $\mathrm{A}= \mathrm{Softmax}(\frac{\mathbf{Q}\mathbf{K}^{\mathrm{T}}}{\sqrt{\mathrm{d}}})$, which reflects the model's focus on different regions of the input, and its corresponding gradient map $\mathrm{G}= \frac{\partial \mathcal{L}}{\partial \mathbf{A}}$, which shows the impact of each pixel on the prediction, to obtain the gradient-weighted map. By summing along the Key dimension, we derive the importance scores. Based on the importance scores, we generate the token mask by applying a threshold. The token mask allows us to zero out the scores of pruned tokens. Moreover, employing specific algorithms, we derive both the merge matrix and the reconstruct matrix using the importance scores.The right part of the figure showcases the structural overview of our prune and merge method. Initially, the input tokens undergo matrix multiplication with the merge matrix, resulting in compressed tokens. These compressed tokens are then fed into the transformer block. To restore the spatial resolution, the output of the block is multiplied by the reconstruct matrix. The pruned tokens are preserved using the negation of the token mask. By incorporating shortcut connections, the pruned tokens are added to the output results, ensuring the retention of feature information.
  • Figure 3: Grad-CAM visualization of ViT-Base's attention map in different layers. The attention on input images varies on different layers.
  • Figure 4: Segmenter + ViT-L compression results at various compression rates on ADE20k. We show the comparison of our methods and state-of-the-art methods. Left: mIOU-GFLOPs curve. Right: mIOU-FPS curve.
  • Figure 5: Visualization of segmentation results on the ADE20K dataset at a 50% token compression rate, comparing our PM-ViT with other SOTA approaches.
  • ...and 2 more figures