Structural Pruning of Pre-trained Language Models via Neural Architecture Search
Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau
TL;DR
The paper tackles the practical inefficiency of large pre-trained language models by proposing NAS-driven structural pruning of fine-tuned PLMs to discover Pareto-optimal sub-networks that balance accuracy and resource use. It reframes pruning as a multi-objective NAS problem, leveraging a two-stage weight-sharing approach to accelerate search and employing binary masks over attention heads and neurons. Four search spaces are explored, along with a benchmarking suite and extensive ablations comparing standard NAS and weight-sharing NAS methods. The results show that weight-sharing NAS can recover sub-networks with competitive performance while dramatically reducing search cost, enabling automated compression after fine-tuning with minimal performance loss and improved deployability. The work provides practical guidelines for choosing search spaces and training strategies and points to future directions such as instruction tuning and more sophisticated sampling toward the Pareto front.
Abstract
Pre-trained language models (PLM), for example BERT or RoBERTa, mark the state-of-the-art for natural language understanding task when fine-tuned on labeled data. However, their large size poses challenges in deploying them for inference in real-world applications, due to significant GPU memory requirements and high inference latency. This paper explores neural architecture search (NAS) for structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency, for example in terms of model size or latency, and generalization performance. We also show how we can utilize more recently developed two-stage weight-sharing NAS approaches in this setting to accelerate the search process. Unlike traditional pruning methods with fixed thresholds, we propose to adopt a multi-objective approach that identifies the Pareto optimal set of sub-networks, allowing for a more flexible and automated compression process.
