Table of Contents
Fetching ...

Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models

Seungcheol Park, Hojun Choi, U Kang

TL;DR

K-prune (Knowledge-preserving pruning), an accurate retraining-free structured pruning algorithm for pretrained encoder-based language models that focuses on preserving the useful knowledge of the pretrained model to minimize pruning errors through a carefully designed iterative pruning process composed of knowledge measurement, knowledge-preserving mask search, and knowledge-preserving weight-tuning.

Abstract

Given a pretrained encoder-based language model, how can we accurately compress it without retraining? Retraining-free structured pruning algorithms are crucial in pretrained language model compression due to their significantly reduced pruning cost and capability to prune large language models. However, existing retraining-free algorithms encounter severe accuracy degradation, as they fail to handle pruning errors, especially at high compression rates. In this paper, we propose K-prune (Knowledge-preserving pruning), an accurate retraining-free structured pruning algorithm for pretrained encoder-based language models. K-prune focuses on preserving the useful knowledge of the pretrained model to minimize pruning errors through a carefully designed iterative pruning process composed of knowledge measurement, knowledge-preserving mask search, and knowledge-preserving weight-tuning. As a result, K-prune shows significant accuracy improvements up to 58.02%p higher F1 score compared to existing retraining-free pruning algorithms under a high compression rate of 80% on the SQuAD benchmark without any retraining process.

Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models

TL;DR

K-prune (Knowledge-preserving pruning), an accurate retraining-free structured pruning algorithm for pretrained encoder-based language models that focuses on preserving the useful knowledge of the pretrained model to minimize pruning errors through a carefully designed iterative pruning process composed of knowledge measurement, knowledge-preserving mask search, and knowledge-preserving weight-tuning.

Abstract

Given a pretrained encoder-based language model, how can we accurately compress it without retraining? Retraining-free structured pruning algorithms are crucial in pretrained language model compression due to their significantly reduced pruning cost and capability to prune large language models. However, existing retraining-free algorithms encounter severe accuracy degradation, as they fail to handle pruning errors, especially at high compression rates. In this paper, we propose K-prune (Knowledge-preserving pruning), an accurate retraining-free structured pruning algorithm for pretrained encoder-based language models. K-prune focuses on preserving the useful knowledge of the pretrained model to minimize pruning errors through a carefully designed iterative pruning process composed of knowledge measurement, knowledge-preserving mask search, and knowledge-preserving weight-tuning. As a result, K-prune shows significant accuracy improvements up to 58.02%p higher F1 score compared to existing retraining-free pruning algorithms under a high compression rate of 80% on the SQuAD benchmark without any retraining process.
Paper Structure (35 sections, 15 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 35 sections, 15 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Accuracy vs. reduced FLOPs of retraining-free pruning algorithms using BERT and DistilBERT where the dotted line indicates the accuracy degradation of 3%p. K-prune (blue star) largely outperforms competitors in all settings.
  • Figure 2: Illustration of K-prune when the second sublayer is our target (best viewed in color). See Section \ref{['subsec:overview']} for details.
  • Figure 3: Accuracy of compressed models vs. time cost for pruning under a compression rate of 75%. K-prune (blue star) shows the best trade-off among both retraining-free and retraining-based pruning algorithms.
  • Figure 4: Change of f1 scores with regard to the change of the temperature $\gamma$ on $\text{SQuAD}_{1.1}$ under compression rates of 40%, 60%, and 80%. The f1 scores of the compressed model exhibit weak sensitivity to the alteration in $\gamma$.
  • Figure 5: Change of f1 scores with regard to the change of the balance coefficient $\lambda$ on $\text{SQuAD}_{1.1}$ under compression rates of 40%, 60%, and 80%. The leftmost and rightmost stars represent the cases that use only predictive or representational knowledge, respectively. Representational knowledge is not effective by itself in general, however, it improves the accuracy of the compressed model when combined with predictive knowledge.
  • ...and 1 more figures