Table of Contents
Fetching ...

Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers

Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim

TL;DR

The paper tackles the inefficiency of Vision Transformers caused by quadratic self-attention by introducing Multi-Criteria Token Fusion (MCTF), which fuses tokens along three criteria—similarity, informativeness, and token size—via bidirectional bipartite soft matching. It enhances informativeness estimation with one-step-ahead attention (using the next-layer map) and enforces token-reduction consistency during finetuning to improve generalization. Empirically, MCTF achieves state-of-the-art speed-accuracy trade-offs, delivering around a 44% FLOPs reduction with modest or positive accuracy gains on DeiT variants and at least 31% speedups on other ViTs like T2T-ViT and LV-ViT without performance loss. The approach is applicable across diverse ViT architectures and provides substantial practical gains for efficient vision models, with code available for replication.

Abstract

Vision Transformer (ViT) has emerged as a prominent backbone for computer vision. For more efficient ViTs, recent works lessen the quadratic cost of the self-attention layer by pruning or fusing the redundant tokens. However, these works faced the speed-accuracy trade-off caused by the loss of information. Here, we argue that token fusion needs to consider diverse relations between tokens to minimize information loss. In this paper, we propose a Multi-criteria Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria (e.g., similarity, informativeness, and size of fused tokens). Further, we utilize the one-step-ahead attention, which is the improved approach to capture the informativeness of the tokens. By training the model equipped with MCTF using a token reduction consistency, we achieve the best speed-accuracy trade-off in the image classification (ImageNet1K). Experimental results prove that MCTF consistently surpasses the previous reduction methods with and without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44% while improving the performance (+0.5%, and +0.3%) over the base model, respectively. We also demonstrate the applicability of MCTF in various Vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup without performance degradation. Code is available at https://github.com/mlvlab/MCTF.

Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers

TL;DR

The paper tackles the inefficiency of Vision Transformers caused by quadratic self-attention by introducing Multi-Criteria Token Fusion (MCTF), which fuses tokens along three criteria—similarity, informativeness, and token size—via bidirectional bipartite soft matching. It enhances informativeness estimation with one-step-ahead attention (using the next-layer map) and enforces token-reduction consistency during finetuning to improve generalization. Empirically, MCTF achieves state-of-the-art speed-accuracy trade-offs, delivering around a 44% FLOPs reduction with modest or positive accuracy gains on DeiT variants and at least 31% speedups on other ViTs like T2T-ViT and LV-ViT without performance loss. The approach is applicable across diverse ViT architectures and provides substantial practical gains for efficient vision models, with code available for replication.

Abstract

Vision Transformer (ViT) has emerged as a prominent backbone for computer vision. For more efficient ViTs, recent works lessen the quadratic cost of the self-attention layer by pruning or fusing the redundant tokens. However, these works faced the speed-accuracy trade-off caused by the loss of information. Here, we argue that token fusion needs to consider diverse relations between tokens to minimize information loss. In this paper, we propose a Multi-criteria Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria (e.g., similarity, informativeness, and size of fused tokens). Further, we utilize the one-step-ahead attention, which is the improved approach to capture the informativeness of the tokens. By training the model equipped with MCTF using a token reduction consistency, we achieve the best speed-accuracy trade-off in the image classification (ImageNet1K). Experimental results prove that MCTF consistently surpasses the previous reduction methods with and without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44% while improving the performance (+0.5%, and +0.3%) over the base model, respectively. We also demonstrate the applicability of MCTF in various Vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup without performance degradation. Code is available at https://github.com/mlvlab/MCTF.
Paper Structure (23 sections, 11 equations, 10 figures, 9 tables)

This paper contains 23 sections, 11 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Comparison of the token reduction methods with DeiT-T (left), and DeiT-S (right). Given a base model marked as blue circle, previous token reduction methods accelerate the speed with the trade-off between accuracy and computational cost. Our MCTF, marked as a star, even brings performance improvements while lessening the complexity of DeiT. Note that after only one finetuning with the specific reduced number of tokens marked as red star, we simply evaluate it with the diverse FLOPs by adjusting the reduced numbers.
  • Figure 2: Visualization of the fused tokens. Given (a) the leftmost image, (b) fusing the tokens with a single criterion $\textbf{W}^\text{sim}$ often results in the excessive fusion of the foreground object. (c) Then considering both similarity and informativeness ($\textbf{W}^\text{sim}\&\textbf{W}^\text{info}$), tokens in the foreground objects are less fused while the tokens in the background are largely fused. (d) Finally, MCTF helps retain the information of each component in the image by preventing the large-size token with the multi-criteria ($\textbf{W}^\text{sim}\&\textbf{W}^\text{info}\&\textbf{W}^\text{size})$.
  • Figure 3: Bidirectional bipartite soft matching. The set of tokens $\mathbf{X}$ is split into two groups $\mathbf{X}^\alpha, \mathbf{X}^\beta$, and bidirectional bipartite soft matching are conducted through Step 1-4. The intensity of the lines indicates the multi-criteria weights $\mathbf{W}^t$.
  • Figure 4: Visualization of attentiveness in consecutive layers.
  • Figure 5: Illustration of attention map in the consecutive layers and approximated attention. (Left) The attention score $\mathbf{A}^l$ is the past influence of the tokens to generate $\mathbf{X}^l$. If we fuse the tokens $\mathbf{X}^l$ based on $\mathbf{A}^l$, $\mathbf{x}_1$ is prone to be fused despite the highest informativeness score in the following attention. So, we instead leverage the informativeness based on the one-step-ahead attention $\mathbf{A}^{l+1}$. (Right) After the fusion, we also aggregate the $\mathbf{A}^{l+1}$ to approximate the attention map $\hat{\mathbf{A}}^{l+1}$ for updating fused tokens $\hat{\mathbf{X}}^l$.
  • ...and 5 more figures