Token Cropr: Faster ViTs for Quite a Few Tasks
Benjamin Bergner, Christoph Lippert, Aravindh Mahendran
TL;DR
This paper introduces Cropr, a token pruning framework for Vision Transformers that learns per-token task relevance using cross-attention-based routing and auxiliary heads. The auxiliary heads are discarded after training, while Last Layer Fusion reactivates pruned tokens for dense tasks, enabling efficient inference with minimal performance loss. Across image classification, semantic segmentation, and object detection, Cropr achieves 1.5–4× speedups and maintains competitive accuracy, including a 2× speedup on ADE20k with only a 0.1 median mIoU drop. The approach scales favorably with model size and input resolution, and LLF-based fusion demonstrates superior performance to alternative token reactivation methods, making Cropr a practical method for accelerating ViTs in diverse vision tasks.
Abstract
The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput. To this end several token pruning and merging approaches have been proposed that improve efficiency by successively reducing the number of tokens. However, it remains an open problem to design a token reduction method that is fast, maintains high performance, and is applicable to various vision tasks. In this work, we present a token pruner that uses auxiliary prediction heads that learn to select tokens end-to-end based on task relevance. These auxiliary heads can be removed after training, leading to throughput close to that of a random pruner. We evaluate our method on image classification, semantic segmentation, object detection, and instance segmentation, and show speedups of 1.5 to 4x with small drops in performance. As a best case, on the ADE20k semantic segmentation benchmark, we observe a 2x speedup relative to the no-pruning baseline, with a negligible performance penalty of 0.1 median mIoU across 5 seeds.
