Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer
Shuai Peng, Di Fu, Baole Wei, Yong Cao, Liangcai Gao, Zhi Tang
TL;DR
VoMix addresses the computational bottleneck of Vision Transformers by introducing a plug-and-play, parameter-free token reduction that operates without training. It identifies highly homogeneous tokens via a layer-wise similarity voting mechanism and softly mixes them into retained tokens, reducing the token count by a factor $r$ per layer while preserving information. The method offers a constrained complexity of $O(N^2D(1/H + r(1-r)))$, and empirical results on ImageNet-1K and Kinetics-400 show throughput improvements around $2 imes$ to $2.4 imes$ with only about $0.2 ext{-}0.6 ext%$ accuracy loss, across image and video modalities. VoMix outperforms existing token-reduction approaches and enables rapid deployment without retraining, with potential additional gains if used as a trainable extension.
Abstract
Despite the remarkable success of Vision Transformers (ViTs) in various visual tasks, they are often hindered by substantial computational cost. In this work, we introduce Vote\&Mix (\textbf{VoMix}), a plug-and-play and parameter-free token reduction method, which can be readily applied to off-the-shelf ViT models \textit{without any training}. VoMix tackles the computational redundancy of ViTs by identifying tokens with high homogeneity through a layer-wise token similarity voting mechanism. Subsequently, the selected tokens are mixed into the retained set, thereby preserving visual information. Experiments demonstrate VoMix significantly improves the speed-accuracy tradeoff of ViTs on both images and videos. Without any training, VoMix achieves a 2$\times$ increase in throughput of existing ViT-H on ImageNet-1K and a 2.4$\times$ increase in throughput of existing ViT-L on Kinetics-400 video dataset, with a mere 0.3\% drop in top-1 accuracy.
