Table of Contents
Fetching ...

Rényi Attention Entropy for Patch Pruning

Hiroaki Aizawa, Yuki Igaue

Abstract

Transformers are strong baselines in both vision and language because self-attention captures long-range dependencies across tokens. However, the cost of self-attention grows quadratically with the number of tokens. Patch pruning mitigates this cost by estimating per-patch importance and removing redundant patches. To identify informative patches for pruning, we introduce a criterion based on the Shannon entropy of the attention distribution. Low-entropy patches, which receive selective and concentrated attention, are kept as important, while high-entropy patches with attention spread across many locations are treated as redundant. We also extend the criterion from Shannon to Rényi entropy, which emphasizes sharp attention peaks and supports pruning strategies that adapt to task needs and computational limits. In experiments on fine-grained image recognition, where patch selection is critical, our method reduced computation while preserving accuracy. Moreover, adjusting the pruning policy through the Rényi entropy measure yields further gains and improves the trade-off between accuracy and computation.

Rényi Attention Entropy for Patch Pruning

Abstract

Transformers are strong baselines in both vision and language because self-attention captures long-range dependencies across tokens. However, the cost of self-attention grows quadratically with the number of tokens. Patch pruning mitigates this cost by estimating per-patch importance and removing redundant patches. To identify informative patches for pruning, we introduce a criterion based on the Shannon entropy of the attention distribution. Low-entropy patches, which receive selective and concentrated attention, are kept as important, while high-entropy patches with attention spread across many locations are treated as redundant. We also extend the criterion from Shannon to Rényi entropy, which emphasizes sharp attention peaks and supports pruning strategies that adapt to task needs and computational limits. In experiments on fine-grained image recognition, where patch selection is critical, our method reduced computation while preserving accuracy. Moreover, adjusting the pruning policy through the Rényi entropy measure yields further gains and improves the trade-off between accuracy and computation.

Paper Structure

This paper contains 24 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of our key idea. The left figure shows an attention entropy map, where red indicates higher entropy and blue indicates lower entropy. We observe that low attention entropy corresponds to foreground regions and high attention entropy to background, which helps identify informative patches. Based on this, we use attention entropy as the pruning criterion, as illustrated on the right.
  • Figure 2: Overall pipeline of the Rényi attention entropy pruning. The pruning procedure is described in Section \ref{['ssec:patch_pruning']}.
  • Figure 3: Visualization of Rényi attention entropy ($\alpha=2.0$) for each Transformer block in DeiT-S. This visualization shows that attention entropy depends on Transformer layer depth, and lower entropy corresponds to foreground regions.
  • Figure 4: Visualizations of patch pruning results for EViT and Rényi attention entropy-based approach on ImageNet-100, FGVC Aircraft, and Oxford Flowers102. From left to right: input image, pruning results at Blocks 4, 7, and 10. The keep rate is $r=0.7$, and for our method we show the results with the tuned $\alpha$.
  • Figure 5: Visualization of Shannon and Rényi attention entropies. For each DeiT-S block, the figure shows histograms of Shannon attention entropy ($\alpha=1.0$) and Rényi attention entropy at different $\alpha$ orders. Blue indicates informative patches that are kept, and red indicates redundant patches that are pruned. The results show that the Rényi order controls peak emphasis and allows the characterization of the attention distribution to be adapted to the task.
  • ...and 1 more figures

Theorems & Definitions (3)

  • definition 1: Patch attention distribution
  • definition 2: Shannon attention entropy
  • definition 3: Rényi attention entropy