Table of Contents
Fetching ...

Vision Transformer Pruning

Mingjian Zhu, Yehui Tang, Kai Han

TL;DR

Vision transformers suffer from high storage and compute demands, hindering deployment on mobile devices. The paper introduces Vision Transformer Pruning (VTP), which learns per-dimension importance scores with $\ell_1$ regularization to prune dimensions in MHSA and MLP, followed by fine-tuning. Across ImageNet-100 and ImageNet-1K using DeiT-base as baseline, VTP achieves substantial parameter and FLOPs reductions with minimal accuracy loss (e.g., up to ~40% pruning with 0.5–1.1% Top-1 drops). This work provides a practical baseline for compressing vision transformers and points to future extensions like pruning heads or layers.

Abstract

Vision transformer has achieved competitive performance on a variety of computer vision applications. However, their storage, run-time memory, and computational demands are hindering the deployment to mobile devices. Here we present a vision transformer pruning approach, which identifies the impacts of dimensions in each layer of transformer and then executes pruning accordingly. By encouraging dimension-wise sparsity in the transformer, important dimensions automatically emerge. A great number of dimensions with small importance scores can be discarded to achieve a high pruning ratio without significantly compromising accuracy. The pipeline for vision transformer pruning is as follows: 1) training with sparsity regularization; 2) pruning dimensions of linear projections; 3) fine-tuning. The reduced parameters and FLOPs ratios of the proposed algorithm are well evaluated and analyzed on ImageNet dataset to demonstrate the effectiveness of our proposed method.

Vision Transformer Pruning

TL;DR

Vision transformers suffer from high storage and compute demands, hindering deployment on mobile devices. The paper introduces Vision Transformer Pruning (VTP), which learns per-dimension importance scores with regularization to prune dimensions in MHSA and MLP, followed by fine-tuning. Across ImageNet-100 and ImageNet-1K using DeiT-base as baseline, VTP achieves substantial parameter and FLOPs reductions with minimal accuracy loss (e.g., up to ~40% pruning with 0.5–1.1% Top-1 drops). This work provides a practical baseline for compressing vision transformers and points to future extensions like pruning heads or layers.

Abstract

Vision transformer has achieved competitive performance on a variety of computer vision applications. However, their storage, run-time memory, and computational demands are hindering the deployment to mobile devices. Here we present a vision transformer pruning approach, which identifies the impacts of dimensions in each layer of transformer and then executes pruning accordingly. By encouraging dimension-wise sparsity in the transformer, important dimensions automatically emerge. A great number of dimensions with small importance scores can be discarded to achieve a high pruning ratio without significantly compromising accuracy. The pipeline for vision transformer pruning is as follows: 1) training with sparsity regularization; 2) pruning dimensions of linear projections; 3) fine-tuning. The reduced parameters and FLOPs ratios of the proposed algorithm are well evaluated and analyzed on ImageNet dataset to demonstrate the effectiveness of our proposed method.

Paper Structure

This paper contains 16 sections, 8 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Vision Transformer Pruning.