Adaptive MLP Pruning for Large Vision Transformers

Chengchao Shen

Adaptive MLP Pruning for Large Vision Transformers

Chengchao Shen

TL;DR

This paper introduces label-free information entropy criterion to fully model the predictions of the original model for more accurate importance evaluation of MLP and proposes an Adaptive MLP Pruning (AMP) method to substantially reduce the parameters of large vision transformers without obvious performance degradation.

Abstract

Large vision transformers present impressive scalability, as their performance can be well improved with increased model capacity. Nevertheless, their cumbersome parameters results in exorbitant computational and memory demands. By analyzing prevalent transformer structures, we find that multilayer perceptron (MLP) modules constitute the largest share of the model's parameters. In this paper, we propose an Adaptive MLP Pruning (AMP) method to substantially reduce the parameters of large vision transformers without obvious performance degradation. First, we adopt Taylor based method to evaluate neuron importance of MLP. However, the importance computation using one-hot cross entropy loss ignores the potential predictions on other categories, thus degrading the quality of the evaluated importance scores. To address this issue, we introduce label-free information entropy criterion to fully model the predictions of the original model for more accurate importance evaluation. Second, we rank the hidden neurons of MLP by the above importance scores and apply binary search algorithm to adaptively prune the ranked neurons according to the redundancy of different MLP modules, thereby avoiding the predefined compression ratio. Experimental results on several state-of-the-art large vision transformers, including CLIP and DINOv2, demonstrate that our method achieves roughly 40\% parameter and FLOPs reduction in a near lossless manner. Moreover, when the models are not finetuned after pruning, our method outperforms other pruning methods by significantly large margin. The source code and trained weights are available at https://github.com/visresearch/AMP.

Adaptive MLP Pruning for Large Vision Transformers

TL;DR

Abstract

Paper Structure (24 sections, 8 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 24 sections, 8 equations, 4 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Model Pruning
Token Reduction
The Proposed Method
Overview
Preliminaries of Neuron Importance Evaluation
Information Entropy for Neuron Importance Evaluation
Adaptive MLP Pruning
Knowledge Distillation
Experiments
Experimental Settings
Zero-Shot Image Classification
Zero-Shot Retrieval
Comparison to Other Pruning Methods
...and 9 more sections

Figures (4)

Figure 1: The overview of the proposed method. First, the importance scores of hidden neurons are evaluated by Taylor based method. Then, we rank the hidden neurons by the obtained importance scores. Afterwards, we conduct binary search to adaptively prune the hidden neurons for MLP modules in transformer. Finally, the pruned model is guided by the original model using knowledge distillation to recover performance.
Figure 2: One-hot cross entropy vs information entropy for neuron importance evaluation. Our proposed information entropy exploits all predictions of the model for more accurate importance evaluation.
Figure 3: Adaptive MLP Pruning. In each pruning step, we conduct binary search algorithm to adaptively reduce the search range of optimal hidden size into half according to information entropy $\mathcal{E}$, until the maximum pruning step number reaches. If the increment of information entropy within the range of $\Delta \mathcal{E}$, we further prune the hidden neurons of MLP. Otherwise, we reduce the number of pruned neurons in the previous step.
Figure 4: The relation between MLP hidden size and information entropy.

Adaptive MLP Pruning for Large Vision Transformers

TL;DR

Abstract

Adaptive MLP Pruning for Large Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (4)