Table of Contents
Fetching ...

Rényi Entropy: A New Token Pruning Metric for Vision Transformers

Wei-Yuan Su, Ruijie Zhang, Zheng Zhang

Abstract

Vision Transformers (ViTs) achieve state-of-the-art performance but suffer from the $O(N^2)$ complexity of self-attention, making inference costly for high-resolution inputs. To address this bottleneck, token pruning has emerged as a critical technique to accelerate inference. Most existing methods rely on the [CLS] token to estimate patch importance. However, we argue that the [CLS] token can be unreliable in early layers where semantic representations are still immature. As a result, pruning in the early layer often leads to inaccurate importance estimation and unnecessary information loss. In this work, we propose a training-free token importance metric, namely Col-Ln, which is derived from Rényi entropy that enables the identification of informative tokens from the first layer of the network, thereby enabling more reliable pruning in token reduction. Extensive experiments on ViTs and Large Vision-Language Models (LVLMs) demonstrate that our approach consistently outperforms state-of-the-art pruning methods across diverse benchmarks.

Rényi Entropy: A New Token Pruning Metric for Vision Transformers

Abstract

Vision Transformers (ViTs) achieve state-of-the-art performance but suffer from the complexity of self-attention, making inference costly for high-resolution inputs. To address this bottleneck, token pruning has emerged as a critical technique to accelerate inference. Most existing methods rely on the [CLS] token to estimate patch importance. However, we argue that the [CLS] token can be unreliable in early layers where semantic representations are still immature. As a result, pruning in the early layer often leads to inaccurate importance estimation and unnecessary information loss. In this work, we propose a training-free token importance metric, namely Col-Ln, which is derived from Rényi entropy that enables the identification of informative tokens from the first layer of the network, thereby enabling more reliable pruning in token reduction. Extensive experiments on ViTs and Large Vision-Language Models (LVLMs) demonstrate that our approach consistently outperforms state-of-the-art pruning methods across diverse benchmarks.

Paper Structure

This paper contains 27 sections, 3 equations, 4 figures, 17 tables, 2 algorithms.

Figures (4)

  • Figure 1: Visual Comparison of Token Pruning. We compare standard [CLS] attention and our proposed metric in the pruning process across initial layers ($L_0$--$L_5$). While [CLS] suffers from semantic immaturity and erroneously prunes critical foreground patches, our method preserves significantly more of the main subject while accurately eliminating redundant background tokens.
  • Figure 2: Training Loss Comparison on the EViT Framework.
  • Figure 3: Heatmap-based Attention Visualization. This figure provides the attention heatmaps for the samples shown in Fig. 1 of the main paper. We compare the standard [CLS] attention (top rows) with our proposed Col-Ln metric (bottom rows) across the initial six layers ($L_0$--$L_5$) of a ViT-Base model. Red colors indicate higher importance scores.
  • Figure 4: Layer-wise attention visualization on ViT-Base with different norm The top row displays traditional [CLS] attention, which is messy and noisy in early layers (L0–L3). In contrast, the bottom rows show our column-wise $\ell_n$-norms. As the norm order $n$ increases (from $\ell_1$ to $\ell_4$), the heatmaps become progressively sharper and more concentrated on the salient object.