Breaking through Deterministic Barriers: Randomized Pruning Mask Generation and Selection

Jianwei Li; Weizhi Gao; Qi Lei; Dongkuan Xu

Breaking through Deterministic Barriers: Randomized Pruning Mask Generation and Selection

Jianwei Li, Weizhi Gao, Qi Lei, Dongkuan Xu

TL;DR

This work tackles the challenge that deterministic magnitude-based pruning often underperforms at high sparsity for large language models. It introduces a principled, mildly randomized pruning framework comprising randomized mask generation, a Mask Candidate Selection Strategy (MCSS), and an Early Mask Evaluation Pipeline (EMEP) to efficiently identify beneficial sparse architectures. Empirically, the method achieves state-of-the-art results on eight GLUE tasks with BERT-based models at $16\times$ compression and shows $2$–$4\%$ gains at extreme sparsity ($100\times$), outperforming several baselines including IMP and distillation-based approaches. The approach balances exploration and efficiency, offering practical benefits for deploying sparse Transformers, while noting scalability challenges for truly billion-parameter models and outlining paths for parallelization and future improvement.

Abstract

It is widely acknowledged that large and sparse models have higher accuracy than small and dense models under the same model size constraints. This motivates us to train a large model and then remove its redundant neurons or weights by pruning. Most existing works pruned the networks in a deterministic way, the performance of which solely depends on a single pruning criterion and thus lacks variety. Instead, in this paper, we propose a model pruning strategy that first generates several pruning masks in a designed random way. Subsequently, along with an effective mask-selection rule, the optimal mask is chosen from the pool of mask candidates. To further enhance efficiency, we introduce an early mask evaluation strategy, mitigating the overhead associated with training multiple masks. Our extensive experiments demonstrate that this approach achieves state-of-the-art performance across eight datasets from GLUE, particularly excelling at high levels of sparsity.

Breaking through Deterministic Barriers: Randomized Pruning Mask Generation and Selection

TL;DR

compression and shows

–

gains at extreme sparsity (

), outperforming several baselines including IMP and distillation-based approaches. The approach balances exploration and efficiency, offering practical benefits for deploying sparse Transformers, while noting scalability challenges for truly billion-parameter models and outlining paths for parallelization and future improvement.

Abstract

Paper Structure (42 sections, 3 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 42 sections, 3 equations, 8 figures, 3 tables, 1 algorithm.

Introduction
Question 1.
Question 2.
Preliminaries
Pruning
Iterative Magnitude Pruning
Knowledge Distillation
Multinomial Distribution
Methodology
Rethink Iterative Magnitude Pruning
Randomized Pruning Mask Generation
Mask Sampling
Controllable Randomness
Accelerated Mask Sampling
Randomized Pruning Mask Selection
...and 27 more sections

Figures (8)

Figure 1: Weight Distribution in a Feedforward Layer of BERT$_{Base}$ at Various Sparsity Levels (0.52 and 0.83), Corresponding to Pruning Thresholds $\tau=0.027$ and $\tau=0.055$. Notably, around 29% of the weights lie within the range [$\frac{2}{3}\tau$, $\frac{4}{3}\tau$]. This observation puts into question the efficacy of magnitude-based pruning, as these weights, despite their proximity to the threshold, might play a crucial role in maintaining the model's accuracy. This suggests that directly eliminating weights with smaller magnitudes could potentially lead to a suboptimal pruning strategy.
Figure 2: Main Architecture of Our Strategy. We replace the deterministic mask generation way in IMP with our randomized method. Specifically, we first introduce a degree of randomness into the process of mask generation in a principled way, then we employ a specific mask selection rule, paired with an efficient evaluation pipe, to distinguish the optimal mask from a pool of candidates.
Figure 3: Comparing the Impact of Randomness in Two Different Schedules with a Deterministic Approach (IMP), which features zero randomness. The horizontal axis presents the logarithmic outputs of $ir$, with larger $ir$ indicating a greater amount of total introduced randomness. The vertical axis signifies the model's accuracy.
Figure 4: Mask Sampling \ref{['ablation:setting1-a']} v.s. Mask Sampling + MCSS \ref{['ablation:setting1-b']}. Note that the green line in \ref{['ablation:setting1-a']} and \ref{['ablation:setting1-b']} represents the same value of accuracy from IMP. The value on the horizontal axis represents the amount of introduced randomness. The value on the vertical axis indicates model accuracy.
Figure 5: Impact of Sparsities \ref{['abl3-a']} and Impact of #Mask Candidates in MCSS \ref{['abl3-b']}. The horizontal axes represent the sparsity level and the number of mask candidates for Figures \ref{['abl3-a']} and \ref{['abl3-b']} respectively, while the vertical axes in both figures denote model accuracy.
...and 3 more figures

Breaking through Deterministic Barriers: Randomized Pruning Mask Generation and Selection

TL;DR

Abstract

Breaking through Deterministic Barriers: Randomized Pruning Mask Generation and Selection

Authors

TL;DR

Abstract

Table of Contents

Figures (8)