Table of Contents
Fetching ...

Structural Pruning of Pre-trained Language Models via Neural Architecture Search

Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau

TL;DR

The paper tackles the practical inefficiency of large pre-trained language models by proposing NAS-driven structural pruning of fine-tuned PLMs to discover Pareto-optimal sub-networks that balance accuracy and resource use. It reframes pruning as a multi-objective NAS problem, leveraging a two-stage weight-sharing approach to accelerate search and employing binary masks over attention heads and neurons. Four search spaces are explored, along with a benchmarking suite and extensive ablations comparing standard NAS and weight-sharing NAS methods. The results show that weight-sharing NAS can recover sub-networks with competitive performance while dramatically reducing search cost, enabling automated compression after fine-tuning with minimal performance loss and improved deployability. The work provides practical guidelines for choosing search spaces and training strategies and points to future directions such as instruction tuning and more sophisticated sampling toward the Pareto front.

Abstract

Pre-trained language models (PLM), for example BERT or RoBERTa, mark the state-of-the-art for natural language understanding task when fine-tuned on labeled data. However, their large size poses challenges in deploying them for inference in real-world applications, due to significant GPU memory requirements and high inference latency. This paper explores neural architecture search (NAS) for structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency, for example in terms of model size or latency, and generalization performance. We also show how we can utilize more recently developed two-stage weight-sharing NAS approaches in this setting to accelerate the search process. Unlike traditional pruning methods with fixed thresholds, we propose to adopt a multi-objective approach that identifies the Pareto optimal set of sub-networks, allowing for a more flexible and automated compression process.

Structural Pruning of Pre-trained Language Models via Neural Architecture Search

TL;DR

The paper tackles the practical inefficiency of large pre-trained language models by proposing NAS-driven structural pruning of fine-tuned PLMs to discover Pareto-optimal sub-networks that balance accuracy and resource use. It reframes pruning as a multi-objective NAS problem, leveraging a two-stage weight-sharing approach to accelerate search and employing binary masks over attention heads and neurons. Four search spaces are explored, along with a benchmarking suite and extensive ablations comparing standard NAS and weight-sharing NAS methods. The results show that weight-sharing NAS can recover sub-networks with competitive performance while dramatically reducing search cost, enabling automated compression after fine-tuning with minimal performance loss and improved deployability. The work provides practical guidelines for choosing search spaces and training strategies and points to future directions such as instruction tuning and more sophisticated sampling toward the Pareto front.

Abstract

Pre-trained language models (PLM), for example BERT or RoBERTa, mark the state-of-the-art for natural language understanding task when fine-tuned on labeled data. However, their large size poses challenges in deploying them for inference in real-world applications, due to significant GPU memory requirements and high inference latency. This paper explores neural architecture search (NAS) for structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency, for example in terms of model size or latency, and generalization performance. We also show how we can utilize more recently developed two-stage weight-sharing NAS approaches in this setting to accelerate the search process. Unlike traditional pruning methods with fixed thresholds, we propose to adopt a multi-objective approach that identifies the Pareto optimal set of sub-networks, allowing for a more flexible and automated compression process.
Paper Structure (19 sections, 2 equations, 9 figures, 1 algorithm)

This paper contains 19 sections, 2 equations, 9 figures, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of our approach. a) We fine-tune the pre-trained architecture by updating only sub-networks, which we select by placing a binary mask over heads and units in each MHA and FFN layer. b) Afterwards, we run a multi-objective search to select the optimal set of sub-networks that balance parameter count and validation error.
  • Figure 2: Examples of head masks ${\bm{M}}_{head}$ sampled uniformly at random from different search spaces. Dark color indicates that the corresponding head is masked. The same pattern can be observed for ${\bm{M}}_{neuron}$
  • Figure 3: Distribution of the parameter count $f_1({\bm{\theta}})$ for uniformly sampled ${\bm{\theta}} \sim \Theta$.
  • Figure 4: Example to compute the Hypervolume $HV(P_f | \mathbf{r})$, corresponding to the sum of the rectangles, across a reference point $\mathbf{r}$ and a set of points $P_f = \{\mathbf{y_0}, \mathbf{y_1}, \mathbf{y_2}, \mathbf{y_3}\}$
  • Figure 5: Comparison of the four different search spaces using weight-sharing based NAS. We sample 100 random sub-networks uniformly at random using the fine-tuned weights of the super-network. The SMALL search space dominates the other search spaces except for the COLA dataset. While SMALL is a subset of MEDIUM and LARGE, these spaces are too high-dimensional to be explored with a sensible compute budget. First two rows show results for BERT-base-cased and last two rows for RoBERTa-base.
  • ...and 4 more figures