Table of Contents
Fetching ...

Reweighted Proximal Pruning for Large-Scale Language Representation

Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, Xue Lin, Yanzhi Wang

TL;DR

The paper tackles the challenge of compressing large-scale language representations like BERT without compromising downstream transfer learning performance. It introduces Reweighted Proximal Pruning (RPP), a pruning framework that combines reweighted $\ell_1$ minimization with a proximal operator to learn a universal sparsity pattern during pre-training, which is then fixed during fine-tuning. Empirical results show RPP achieves substantial sparsity (up to ~60%) with minimal loss on pre-training and most GLUE tasks, and it markedly outperforms a baseline iterative pruning approach on SQuAD and other transfers. Visualizations reveal that RPP induces structured sparsity in transformer blocks and preserves the semantic geometry of word embeddings, supporting its practical use for deploying large language models on resource-constrained devices.

Abstract

Recently, pre-trained language representation flourishes as the mainstay of the natural language understanding community, e.g., BERT. These pre-trained language representations can create state-of-the-art results on a wide range of downstream tasks. Along with continuous significant performance improvement, the size and complexity of these pre-trained neural models continue to increase rapidly. Is it possible to compress these large-scale language representation models? How will the pruned language representation affect the downstream multi-task transfer learning objectives? In this paper, we propose Reweighted Proximal Pruning (RPP), a new pruning method specifically designed for a large-scale language representation model. Through experiments on SQuAD and the GLUE benchmark suite, we show that proximal pruned BERT keeps high accuracy for both the pre-training task and the downstream multiple fine-tuning tasks at high prune ratio. RPP provides a new perspective to help us analyze what large-scale language representation might learn. Additionally, RPP makes it possible to deploy a large state-of-the-art language representation model such as BERT on a series of distinct devices (e.g., online servers, mobile phones, and edge devices).

Reweighted Proximal Pruning for Large-Scale Language Representation

TL;DR

The paper tackles the challenge of compressing large-scale language representations like BERT without compromising downstream transfer learning performance. It introduces Reweighted Proximal Pruning (RPP), a pruning framework that combines reweighted minimization with a proximal operator to learn a universal sparsity pattern during pre-training, which is then fixed during fine-tuning. Empirical results show RPP achieves substantial sparsity (up to ~60%) with minimal loss on pre-training and most GLUE tasks, and it markedly outperforms a baseline iterative pruning approach on SQuAD and other transfers. Visualizations reveal that RPP induces structured sparsity in transformer blocks and preserves the semantic geometry of word embeddings, supporting its practical use for deploying large language models on resource-constrained devices.

Abstract

Recently, pre-trained language representation flourishes as the mainstay of the natural language understanding community, e.g., BERT. These pre-trained language representations can create state-of-the-art results on a wide range of downstream tasks. Along with continuous significant performance improvement, the size and complexity of these pre-trained neural models continue to increase rapidly. Is it possible to compress these large-scale language representation models? How will the pruned language representation affect the downstream multi-task transfer learning objectives? In this paper, we propose Reweighted Proximal Pruning (RPP), a new pruning method specifically designed for a large-scale language representation model. Through experiments on SQuAD and the GLUE benchmark suite, we show that proximal pruned BERT keeps high accuracy for both the pre-training task and the downstream multiple fine-tuning tasks at high prune ratio. RPP provides a new perspective to help us analyze what large-scale language representation might learn. Additionally, RPP makes it possible to deploy a large state-of-the-art language representation model such as BERT on a series of distinct devices (e.g., online servers, mobile phones, and edge devices).

Paper Structure

This paper contains 27 sections, 9 equations, 12 figures, 1 table, 3 algorithms.

Figures (12)

  • Figure 1: Overview of pruning BERT using Reweighted Proximal Pruning algorithm and then fine-tuning on a wide range of downstream transfer learning tasks. Through RPP, we find the identical universal sparsity $\mathcal{ S }_{\hat{\mathbf{w}}}$. The BERT model pruned with RPP could be fine-tuned over the downstream transfer learning tasks.
  • Figure 2: Evaluate the performance of pruned $\mathrm { BERT } _ { \mathrm { BASE } }$ using NIP and RPP, respectively (MLM and NSP accuracy on pre-training data and F1 score of fine-tuning on SQuAD 1.1 are reported).
  • Figure 3: Visualization of sparse pattern $\mathcal{S}$ in pruned $\mathrm { BERT } _ { \mathrm { BASE } }$ model $\mathbf{w}$. We sample 6 matrices (3 query matrices at the top row and 3 key matrices at the bottom row) from layer 2, layer 3 and layer 11 in the sparest pruned $\mathrm { BERT } _ { \mathrm { BASE } }$.
  • Figure 4: $t$-SNE visualization of word embeddings in the original BERT model and the pruned BERT model using RPP. From left to right: $t$-SNE of original BERT embedding, together with an enlarging region around word "intelligent"; $t$-SNE of embedding in pruned BERT, together with an enlarging region. These visualizations are obtained by running t-SNE for 1000 steps with perplexity=100.
  • Figure A1: Evaluate the performance of pruned $\mathrm { BERT } _ { \mathrm { BASE } }$ using NIP and RPP, respectively (MLM and NSP accuracy on pre-training data and F1 score of fine-tuning on QQP are reported).
  • ...and 7 more figures