Table of Contents
Fetching ...

Token Cropr: Faster ViTs for Quite a Few Tasks

Benjamin Bergner, Christoph Lippert, Aravindh Mahendran

TL;DR

This paper introduces Cropr, a token pruning framework for Vision Transformers that learns per-token task relevance using cross-attention-based routing and auxiliary heads. The auxiliary heads are discarded after training, while Last Layer Fusion reactivates pruned tokens for dense tasks, enabling efficient inference with minimal performance loss. Across image classification, semantic segmentation, and object detection, Cropr achieves 1.5–4× speedups and maintains competitive accuracy, including a 2× speedup on ADE20k with only a 0.1 median mIoU drop. The approach scales favorably with model size and input resolution, and LLF-based fusion demonstrates superior performance to alternative token reactivation methods, making Cropr a practical method for accelerating ViTs in diverse vision tasks.

Abstract

The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput. To this end several token pruning and merging approaches have been proposed that improve efficiency by successively reducing the number of tokens. However, it remains an open problem to design a token reduction method that is fast, maintains high performance, and is applicable to various vision tasks. In this work, we present a token pruner that uses auxiliary prediction heads that learn to select tokens end-to-end based on task relevance. These auxiliary heads can be removed after training, leading to throughput close to that of a random pruner. We evaluate our method on image classification, semantic segmentation, object detection, and instance segmentation, and show speedups of 1.5 to 4x with small drops in performance. As a best case, on the ADE20k semantic segmentation benchmark, we observe a 2x speedup relative to the no-pruning baseline, with a negligible performance penalty of 0.1 median mIoU across 5 seeds.

Token Cropr: Faster ViTs for Quite a Few Tasks

TL;DR

This paper introduces Cropr, a token pruning framework for Vision Transformers that learns per-token task relevance using cross-attention-based routing and auxiliary heads. The auxiliary heads are discarded after training, while Last Layer Fusion reactivates pruned tokens for dense tasks, enabling efficient inference with minimal performance loss. Across image classification, semantic segmentation, and object detection, Cropr achieves 1.5–4× speedups and maintains competitive accuracy, including a 2× speedup on ADE20k with only a 0.1 median mIoU drop. The approach scales favorably with model size and input resolution, and LLF-based fusion demonstrates superior performance to alternative token reactivation methods, making Cropr a practical method for accelerating ViTs in diverse vision tasks.

Abstract

The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput. To this end several token pruning and merging approaches have been proposed that improve efficiency by successively reducing the number of tokens. However, it remains an open problem to design a token reduction method that is fast, maintains high performance, and is applicable to various vision tasks. In this work, we present a token pruner that uses auxiliary prediction heads that learn to select tokens end-to-end based on task relevance. These auxiliary heads can be removed after training, leading to throughput close to that of a random pruner. We evaluate our method on image classification, semantic segmentation, object detection, and instance segmentation, and show speedups of 1.5 to 4x with small drops in performance. As a best case, on the ADE20k semantic segmentation benchmark, we observe a 2x speedup relative to the no-pruning baseline, with a negligible performance penalty of 0.1 median mIoU across 5 seeds.

Paper Structure

This paper contains 42 sections, 6 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Cross-attention pruning (Cropr) modules successively prune less relevant tokens, retaining only the most discriminative ones for deeper layers. Our method accelerates ViTs while maintaining high performance and is applicable to many vision tasks, from classification to segmentation and detection. The example castle images illustrate the pruning process. The heatmap visualizes which tokens were pruned at each block $1$ to $L$ in the network.
  • Figure 2: Cropr module during training. The router scores and separates salient keep tokens from uninformative tokens to be pruned. The scorer's attention matrix, $\mathbf{A}$, is reused in the aggregator whose output is used to make intermediate predictions. Gradient flow indicated as a dotted red line feeds back into the scorer and queries.
  • Figure 3: Cropr module during inference. (a) The aggregation function and the auxiliary head are removed. All queries are aggregated into a single query. (b) These optimizations speed up Cropr, with throughput comparable to that of a random selector. Results are shown for semantic segmentation.
  • Figure 4: Performance-throughput tradeoff plot for different model sizes on ImageNet-1k. Token pruning in larger models provides more speedup and less performance drop.
  • Figure 5: Semantic segmentation results on ADE20k. Cropr performs comparable to the unpruned baseline, while achieving a $2\times$ speedup, marked using the dashed vertical line. 5 seeds / method.
  • ...and 6 more figures