Breaking through Deterministic Barriers: Randomized Pruning Mask Generation and Selection
Jianwei Li, Weizhi Gao, Qi Lei, Dongkuan Xu
TL;DR
This work tackles the challenge that deterministic magnitude-based pruning often underperforms at high sparsity for large language models. It introduces a principled, mildly randomized pruning framework comprising randomized mask generation, a Mask Candidate Selection Strategy (MCSS), and an Early Mask Evaluation Pipeline (EMEP) to efficiently identify beneficial sparse architectures. Empirically, the method achieves state-of-the-art results on eight GLUE tasks with BERT-based models at $16\times$ compression and shows $2$–$4\%$ gains at extreme sparsity ($100\times$), outperforming several baselines including IMP and distillation-based approaches. The approach balances exploration and efficiency, offering practical benefits for deploying sparse Transformers, while noting scalability challenges for truly billion-parameter models and outlining paths for parallelization and future improvement.
Abstract
It is widely acknowledged that large and sparse models have higher accuracy than small and dense models under the same model size constraints. This motivates us to train a large model and then remove its redundant neurons or weights by pruning. Most existing works pruned the networks in a deterministic way, the performance of which solely depends on a single pruning criterion and thus lacks variety. Instead, in this paper, we propose a model pruning strategy that first generates several pruning masks in a designed random way. Subsequently, along with an effective mask-selection rule, the optimal mask is chosen from the pool of mask candidates. To further enhance efficiency, we introduce an early mask evaluation strategy, mitigating the overhead associated with training multiple masks. Our extensive experiments demonstrate that this approach achieves state-of-the-art performance across eight datasets from GLUE, particularly excelling at high levels of sparsity.
