Reweighted Proximal Pruning for Large-Scale Language Representation
Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, Xue Lin, Yanzhi Wang
TL;DR
The paper tackles the challenge of compressing large-scale language representations like BERT without compromising downstream transfer learning performance. It introduces Reweighted Proximal Pruning (RPP), a pruning framework that combines reweighted $\ell_1$ minimization with a proximal operator to learn a universal sparsity pattern during pre-training, which is then fixed during fine-tuning. Empirical results show RPP achieves substantial sparsity (up to ~60%) with minimal loss on pre-training and most GLUE tasks, and it markedly outperforms a baseline iterative pruning approach on SQuAD and other transfers. Visualizations reveal that RPP induces structured sparsity in transformer blocks and preserves the semantic geometry of word embeddings, supporting its practical use for deploying large language models on resource-constrained devices.
Abstract
Recently, pre-trained language representation flourishes as the mainstay of the natural language understanding community, e.g., BERT. These pre-trained language representations can create state-of-the-art results on a wide range of downstream tasks. Along with continuous significant performance improvement, the size and complexity of these pre-trained neural models continue to increase rapidly. Is it possible to compress these large-scale language representation models? How will the pruned language representation affect the downstream multi-task transfer learning objectives? In this paper, we propose Reweighted Proximal Pruning (RPP), a new pruning method specifically designed for a large-scale language representation model. Through experiments on SQuAD and the GLUE benchmark suite, we show that proximal pruned BERT keeps high accuracy for both the pre-training task and the downstream multiple fine-tuning tasks at high prune ratio. RPP provides a new perspective to help us analyze what large-scale language representation might learn. Additionally, RPP makes it possible to deploy a large state-of-the-art language representation model such as BERT on a series of distinct devices (e.g., online servers, mobile phones, and edge devices).
