Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

Shuai Peng; Di Fu; Baole Wei; Yong Cao; Liangcai Gao; Zhi Tang

Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

Shuai Peng, Di Fu, Baole Wei, Yong Cao, Liangcai Gao, Zhi Tang

TL;DR

VoMix addresses the computational bottleneck of Vision Transformers by introducing a plug-and-play, parameter-free token reduction that operates without training. It identifies highly homogeneous tokens via a layer-wise similarity voting mechanism and softly mixes them into retained tokens, reducing the token count by a factor $r$ per layer while preserving information. The method offers a constrained complexity of $O(N^2D(1/H + r(1-r)))$, and empirical results on ImageNet-1K and Kinetics-400 show throughput improvements around $2 imes$ to $2.4 imes$ with only about $0.2 ext{-}0.6 ext%$ accuracy loss, across image and video modalities. VoMix outperforms existing token-reduction approaches and enables rapid deployment without retraining, with potential additional gains if used as a trainable extension.

Abstract

Despite the remarkable success of Vision Transformers (ViTs) in various visual tasks, they are often hindered by substantial computational cost. In this work, we introduce Vote\&Mix (\textbf{VoMix}), a plug-and-play and parameter-free token reduction method, which can be readily applied to off-the-shelf ViT models \textit{without any training}. VoMix tackles the computational redundancy of ViTs by identifying tokens with high homogeneity through a layer-wise token similarity voting mechanism. Subsequently, the selected tokens are mixed into the retained set, thereby preserving visual information. Experiments demonstrate VoMix significantly improves the speed-accuracy tradeoff of ViTs on both images and videos. Without any training, VoMix achieves a 2$\times$ increase in throughput of existing ViT-H on ImageNet-1K and a 2.4$\times$ increase in throughput of existing ViT-L on Kinetics-400 video dataset, with a mere 0.3\% drop in top-1 accuracy.

Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

TL;DR

per layer while preserving information. The method offers a constrained complexity of

, and empirical results on ImageNet-1K and Kinetics-400 show throughput improvements around

with only about

accuracy loss, across image and video modalities. VoMix outperforms existing token-reduction approaches and enables rapid deployment without retraining, with potential additional gains if used as a trainable extension.

Abstract

increase in throughput of existing ViT-H on ImageNet-1K and a 2.4

increase in throughput of existing ViT-L on Kinetics-400 video dataset, with a mere 0.3\% drop in top-1 accuracy.

Paper Structure (14 sections, 5 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 14 sections, 5 equations, 8 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Efficient Vision Transformers
Token Reduction
Vote&Mix
Token Vote
Token Mix
Complexity Analysis
Experiments
Image Experiments
Video Experiments
Ablation Study
Discussion
Conclusion

Figures (8)

Figure 1: VoMix improves the speed-accuracy tradeoff of ViTs on Kinetics-400.
Figure 2: The overview of VoMix. VoMix is a plug-and-play module that can be easily applied to off-the-shelf ViT models. In each transformer block, VoMix reduces a proportion of $r$ tokens in the modified attention mechanism. VoMix has three stages: (1) Vote. VoMix votes $N \cdot r$ tokens out of $N$ tokens via similarity between keys. (2) Mix. VoMix mixes queries of selected tokens into the retained. (3) Attention. VoMix conducts attention using mixed queries and vanilla keys.
Figure 3: The speed-accuracy tradeoff on MAE models. We use the same pruning ratio settings for each method on the same tier of ViTs for fairness. The pruning values are $r=(3\%)^{12}, (5\%)^{12}, (7\%)^{12}, (10\%)^{12}, (12\%)^{12}$.
Figure 4: Visualization of feature source. The red fine boxes denote the final retained tokens by VoMix. The same color block in mixed image denotes they are primarily mixed into one token in the last layer. For each image, we select two representative tokens and visualize their feature source.
Figure 5: Image Visualization. The two rows display the original images and the mixed images. The color blocks indicate that VoMix mixes the region into one token.
...and 3 more figures

Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

TL;DR

Abstract

Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (8)