GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

Xuwei Xu; Sen Wang; Yudong Chen; Yanping Zheng; Zhewei Wei; Jiajun Liu

GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

Xuwei Xu, Sen Wang, Yudong Chen, Yanping Zheng, Zhewei Wei, Jiajun Liu

TL;DR

This work tackles the high computational cost of Vision Transformers by reframing token reduction as a token summarization task. It introduces Graph-based Token Propagation (GTP), which uses a token-level graph to propagate information from discarded tokens to their spatially and semantically connected neighbors, yielding a condensed representation while preserving information. GTP comprises a token selection strategy based on regeneration difficulty $\gamma_i$ and broadcasting ability $\psi_i$, a sparse spatial-semantic token graph, and a summarization step $\mathbf{X}^s = \mathbf{X}^k + \alpha \hat{\mathcal{A}}^{p}\mathbf{X}^{p}$, followed by attention sparsification via proportional attention and top-$\theta N^2$ pruning. Empirically, GTP delivers substantial speedups (up to ~26% GMAC reduction and ~25–28% real inference speed-up) on pretrained DeiT backbones with minimal (~0.3%) accuracy loss, outperforming state-of-the-art token pruning/merging methods without finetuning and extending to ViTs without CLS tokens and higher token counts.

Abstract

Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging due to high computational demands. To expedite pre-trained ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in the computation. However, these methods still have some limitations, such as image information loss from pruned tokens and inefficiency in the token-matching process. In this paper, we introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs. Inspired by graph summarization algorithms, GTP meticulously propagates less significant tokens' information to spatially and semantically connected tokens that are of greater importance. Consequently, the remaining few tokens serve as a summarization of the entire token graph, allowing the method to reduce computational complexity while preserving essential information of eliminated tokens. Combined with an innovative token selection strategy, GTP can efficiently identify image tokens to be propagated. Extensive experiments have validated GTP's effectiveness, demonstrating both efficiency and performance improvements. Specifically, GTP decreases the computational complexity of both DeiT-S and DeiT-B by up to 26% with only a minimal 0.3% accuracy drop on ImageNet-1K without finetuning, and remarkably surpasses the state-of-the-art token merging method on various backbones at an even faster inference speed. The source code is available at https://github.com/Ackesnal/GTP-ViT.

GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

TL;DR

and broadcasting ability

, a sparse spatial-semantic token graph, and a summarization step

, followed by attention sparsification via proportional attention and top-

pruning. Empirically, GTP delivers substantial speedups (up to ~26% GMAC reduction and ~25–28% real inference speed-up) on pretrained DeiT backbones with minimal (~0.3%) accuracy loss, outperforming state-of-the-art token pruning/merging methods without finetuning and extending to ViTs without CLS tokens and higher token counts.

Abstract

Paper Structure (43 sections, 15 equations, 8 figures, 10 tables)

This paper contains 43 sections, 15 equations, 8 figures, 10 tables.

Introduction
Related works
Efficient Vision Transformers.
Token pruning and merging.
Methods
Preliminaries
Vision Transformer.
Graph Neural Network.
Efficient token propagation
Token selection
Regeneration difficulty.
Broadcasting ability.
Token selection.
Analysis.
Sparse graph construction
...and 28 more sections

Figures (8)

Figure 2: Comparisons among existing token pruning rao2021dynamicvitliang2021evitkong2022spvit (top), token merging bolya2022token (middle) and our token summarization (bottom) methods. Both token pruning and token summarization can efficiently measure the importance of each token and determine which tokens should be discarded, providing a computational advantage over token merging. However, only token merging and token summarization successfully preserve the information of eliminated tokens.
Figure 3: Graph-based Token Propagation (GTP) visualization. GTP constructs a graph of image tokens after the token embedding layer only once. Within each transformer block, GTP utilizes the attention map computed in the MHSA layer to estimate the importance score for each image token. Next, it propagates less significant tokens to important tokens w.r.t. a subgraph that only contains edges from propagated tokens to kept tokens. As a result, the remaining tokens form a condensed graph representation of the entire image.
Figure 4: Visualization of token summarization results. We employ GTP on DeiT-B touvron2021training and set the number of propagated tokens $P$ to 8. Unlike existing token pruning models that focus primarily on eliminating less significant background tokens, GTP ensures the retention of certain background tokens, thereby providing a summarized representation of the original image.
Figure 5: Comparisons of different token selection strategies. We apply different token selection strategies with GTP and report the top-1 accuracy for various numbers of propagated tokens ($P$).
Figure 6: Average cosine similarity. We calculate the average cosine similarity among image tokens in each layer for various token reduction methods finetuned on DeiT-S.
...and 3 more figures

GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

TL;DR

Abstract

GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)