Vision Transformer Pruning
Mingjian Zhu, Yehui Tang, Kai Han
TL;DR
Vision transformers suffer from high storage and compute demands, hindering deployment on mobile devices. The paper introduces Vision Transformer Pruning (VTP), which learns per-dimension importance scores with $\ell_1$ regularization to prune dimensions in MHSA and MLP, followed by fine-tuning. Across ImageNet-100 and ImageNet-1K using DeiT-base as baseline, VTP achieves substantial parameter and FLOPs reductions with minimal accuracy loss (e.g., up to ~40% pruning with 0.5–1.1% Top-1 drops). This work provides a practical baseline for compressing vision transformers and points to future extensions like pruning heads or layers.
Abstract
Vision transformer has achieved competitive performance on a variety of computer vision applications. However, their storage, run-time memory, and computational demands are hindering the deployment to mobile devices. Here we present a vision transformer pruning approach, which identifies the impacts of dimensions in each layer of transformer and then executes pruning accordingly. By encouraging dimension-wise sparsity in the transformer, important dimensions automatically emerge. A great number of dimensions with small importance scores can be discarded to achieve a high pruning ratio without significantly compromising accuracy. The pipeline for vision transformer pruning is as follows: 1) training with sparsity regularization; 2) pruning dimensions of linear projections; 3) fine-tuning. The reduced parameters and FLOPs ratios of the proposed algorithm are well evaluated and analyzed on ImageNet dataset to demonstrate the effectiveness of our proposed method.
