LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models
Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Zhengfei Chen, Graziano Chesi, Ngai Wong, Hao Yu
TL;DR
The paper tackles inefficient post-training pruning of large language models by addressing both inter-block error propagation and the static nature of pruning masks. It introduces LLM-Barber, a one-shot, block-aware sparsity mask rebuilder that uses the product of weights and gradients as a pruning metric to identify salient weights across Self-Attention and MLP blocks, enabling global optimization without retraining. The method demonstrates state-of-the-art perplexity and zero-shot performance on LLaMA and OPT families at 50% sparsity, and integrates smoothly with structured N:M sparsity and hardware-aware quantization (e.g., APTQ) for accelerated deployment on GPUs and FPGAs. These findings highlight a practical pathway for fast, high-accuracy, deployment-ready pruning of massive LLMs without expensive retraining cycles, broadening accessibility of large models in resource-constrained settings.
Abstract
Large language models (LLMs) have seen substantial growth, necessitating efficient model pruning techniques. Existing post-training pruning methods primarily measure weight importance in converged dense models, often overlooking changes in weight significance during the pruning process, leading to performance degradation. To address this issue, we present LLM-Barber (Block-Aware Rebuilder for Sparsity Mask in One-Shot), a novel one-shot pruning framework that rebuilds the sparsity mask of pruned models without any retraining or weight reconstruction. LLM-Barber incorporates block-aware error optimization across Self-Attention and MLP blocks, facilitating global performance optimization. We are the first to employ the product of weights and gradients as a pruning metric in the context of LLM post-training pruning. This enables accurate identification of weight importance in massive models and significantly reduces computational complexity compared to methods using secondorder information. Our experiments show that LLM-Barber efficiently prunes models from LLaMA and OPT families (7B to 13B) on a single A100 GPU in just 30 minutes, achieving state-of-the-art results in both perplexity and zero-shot performance across various language benchmarks. Code is available at https://github.com/YupengSu/LLM-Barber.
