Table of Contents
Fetching ...

Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

Shuai Peng, Di Fu, Baole Wei, Yong Cao, Liangcai Gao, Zhi Tang

TL;DR

VoMix addresses the computational bottleneck of Vision Transformers by introducing a plug-and-play, parameter-free token reduction that operates without training. It identifies highly homogeneous tokens via a layer-wise similarity voting mechanism and softly mixes them into retained tokens, reducing the token count by a factor $r$ per layer while preserving information. The method offers a constrained complexity of $O(N^2D(1/H + r(1-r)))$, and empirical results on ImageNet-1K and Kinetics-400 show throughput improvements around $2 imes$ to $2.4 imes$ with only about $0.2 ext{-}0.6 ext%$ accuracy loss, across image and video modalities. VoMix outperforms existing token-reduction approaches and enables rapid deployment without retraining, with potential additional gains if used as a trainable extension.

Abstract

Despite the remarkable success of Vision Transformers (ViTs) in various visual tasks, they are often hindered by substantial computational cost. In this work, we introduce Vote\&Mix (\textbf{VoMix}), a plug-and-play and parameter-free token reduction method, which can be readily applied to off-the-shelf ViT models \textit{without any training}. VoMix tackles the computational redundancy of ViTs by identifying tokens with high homogeneity through a layer-wise token similarity voting mechanism. Subsequently, the selected tokens are mixed into the retained set, thereby preserving visual information. Experiments demonstrate VoMix significantly improves the speed-accuracy tradeoff of ViTs on both images and videos. Without any training, VoMix achieves a 2$\times$ increase in throughput of existing ViT-H on ImageNet-1K and a 2.4$\times$ increase in throughput of existing ViT-L on Kinetics-400 video dataset, with a mere 0.3\% drop in top-1 accuracy.

Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

TL;DR

VoMix addresses the computational bottleneck of Vision Transformers by introducing a plug-and-play, parameter-free token reduction that operates without training. It identifies highly homogeneous tokens via a layer-wise similarity voting mechanism and softly mixes them into retained tokens, reducing the token count by a factor per layer while preserving information. The method offers a constrained complexity of , and empirical results on ImageNet-1K and Kinetics-400 show throughput improvements around to with only about accuracy loss, across image and video modalities. VoMix outperforms existing token-reduction approaches and enables rapid deployment without retraining, with potential additional gains if used as a trainable extension.

Abstract

Despite the remarkable success of Vision Transformers (ViTs) in various visual tasks, they are often hindered by substantial computational cost. In this work, we introduce Vote\&Mix (\textbf{VoMix}), a plug-and-play and parameter-free token reduction method, which can be readily applied to off-the-shelf ViT models \textit{without any training}. VoMix tackles the computational redundancy of ViTs by identifying tokens with high homogeneity through a layer-wise token similarity voting mechanism. Subsequently, the selected tokens are mixed into the retained set, thereby preserving visual information. Experiments demonstrate VoMix significantly improves the speed-accuracy tradeoff of ViTs on both images and videos. Without any training, VoMix achieves a 2 increase in throughput of existing ViT-H on ImageNet-1K and a 2.4 increase in throughput of existing ViT-L on Kinetics-400 video dataset, with a mere 0.3\% drop in top-1 accuracy.
Paper Structure (14 sections, 5 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 14 sections, 5 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: VoMix improves the speed-accuracy tradeoff of ViTs on Kinetics-400.
  • Figure 2: The overview of VoMix. VoMix is a plug-and-play module that can be easily applied to off-the-shelf ViT models. In each transformer block, VoMix reduces a proportion of $r$ tokens in the modified attention mechanism. VoMix has three stages: (1) Vote. VoMix votes $N \cdot r$ tokens out of $N$ tokens via similarity between keys. (2) Mix. VoMix mixes queries of selected tokens into the retained. (3) Attention. VoMix conducts attention using mixed queries and vanilla keys.
  • Figure 3: The speed-accuracy tradeoff on MAE models. We use the same pruning ratio settings for each method on the same tier of ViTs for fairness. The pruning values are $r=(3\%)^{12}, (5\%)^{12}, (7\%)^{12}, (10\%)^{12}, (12\%)^{12}$.
  • Figure 4: Visualization of feature source. The red fine boxes denote the final retained tokens by VoMix. The same color block in mixed image denotes they are primarily mixed into one token in the last layer. For each image, we select two representative tokens and visualize their feature source.
  • Figure 5: Image Visualization. The two rows display the original images and the mixed images. The color blocks indicate that VoMix mixes the region into one token.
  • ...and 3 more figures