Adaptive Pruning of Pretrained Transformer via Differential Inclusions

Yizhuo Ding; Ke Fan; Yikai Wang; Xinwei Sun; Yanwei Fu

Adaptive Pruning of Pretrained Transformer via Differential Inclusions

Yizhuo Ding, Ke Fan, Yikai Wang, Xinwei Sun, Yanwei Fu

TL;DR

The paper tackles the efficiency challenge of deploying large pretrained transformers by introducing Solution Path Pruning (SPP), a single-stage pruning framework that generates a complete regularization solution path for mask-based pruning, yielding a Transformer Weight Family with varying sparsity. It uses a differential inclusion to continuously morph the mask and an augmented loss whose solution path reveals sparsity-structured architectures without restarting the search. The authors prove global convergence under Kurdyka-Łojasiewicz conditions and demonstrate strong empirical performance across vision and language models, including notable pruning outcomes for CLIP and LLMs with minimal accuracy loss. This approach offers a practical and theoretically grounded route to flexible, hardware-friendly transformer compression with potential broad impact on deployment of large models.

Abstract

Large transformers have demonstrated remarkable success, making it necessary to compress these models to reduce inference costs while preserving their perfor-mance. Current compression algorithms prune transformers at fixed compression ratios, requiring a unique pruning process for each ratio, which results in high computational costs. In contrast, we propose pruning of pretrained transformers at any desired ratio within a single pruning stage, based on a differential inclusion for a mask parameter. This dynamic can generate the whole regularization solution path of the mask parameter, whose support set identifies the network structure. Therefore, the solution path identifies a Transformer weight family with various sparsity levels, offering greater flexibility and customization. In this paper, we introduce such an effective pruning method, termed SPP (Solution Path Pruning). To achieve effective pruning, we segment the transformers into paired modules, including query-key pairs, value-projection pairs, and sequential linear layers, and apply low-rank compression to these pairs, maintaining the output structure while enabling structural compression within the inner states. Extensive experiments conducted on various well-known transformer backbones have demonstrated the efficacy of SPP.

Adaptive Pruning of Pretrained Transformer via Differential Inclusions

TL;DR

Abstract

Paper Structure (20 sections, 4 theorems, 59 equations, 4 figures, 9 tables, 2 algorithms)

This paper contains 20 sections, 4 theorems, 59 equations, 4 figures, 9 tables, 2 algorithms.

Introduction
Related work
Method
Mask-based pruning
Differential inclusion for regularization weight family
Convergence
Experiments
Main results
Ablation studies
Further studies
Conclusion
Experiments details and visualization
Group Lasso
More Related Works
Proof of theorem \ref{['Thm:conv-SLBI']}
...and 5 more sections

Key Result

Theorem 1

[Global Convergence of SPP] Suppose that Assumption Assumption holds. Let $(W_{k},\Gamma_{k})$ be the sequence generated by Our method (Eq. (Eq:weight_fa_alg)) with a finite initialization. If where $C=\max|W_0|$ is a max value of the pretrained model then $(M_{k},\Gamma_{k})$ converges to a critical point of $\bar{\mathcal{L}}$ defined in Eq. (Eq:weight_fa_alg), and $\{M^{k}\}$ converges to a cr

Figures (4)

Figure 1: Comparison of SPP and lasso method. (a) SPP can obtain sparse models of all sparsity after search stage , which includes update stage and prune stage. (b) Lasso method can only obtain one sparse model in a single search stage.
Figure 2: Visualization of solution path of DeiT-small. We show the changes of the L1-norm of projected weight value $\Gamma$ during the search stage. The x-axis is the iteration number during training, the y-axis is the L1-norm of the $\Gamma$ parameters per layer.
Figure 3: Visualization of the proportion of parameters on DeiT-Small. The three kind of color indicate three pairs of weight.
Figure 4: Visualization of components. The depth of color shows the sparsity of the corresponding layer. The number shows the dim of the linear matrix.

Theorems & Definitions (16)

Remark 1
Remark 2
Theorem 1
Theorem 2
Remark 3
Corollary 1
proof
Lemma 1
proof
proof
...and 6 more

Adaptive Pruning of Pretrained Transformer via Differential Inclusions

TL;DR

Abstract

Adaptive Pruning of Pretrained Transformer via Differential Inclusions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (16)