Table of Contents
Fetching ...

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Zhengfei Chen, Graziano Chesi, Ngai Wong, Hao Yu

TL;DR

The paper tackles inefficient post-training pruning of large language models by addressing both inter-block error propagation and the static nature of pruning masks. It introduces LLM-Barber, a one-shot, block-aware sparsity mask rebuilder that uses the product of weights and gradients as a pruning metric to identify salient weights across Self-Attention and MLP blocks, enabling global optimization without retraining. The method demonstrates state-of-the-art perplexity and zero-shot performance on LLaMA and OPT families at 50% sparsity, and integrates smoothly with structured N:M sparsity and hardware-aware quantization (e.g., APTQ) for accelerated deployment on GPUs and FPGAs. These findings highlight a practical pathway for fast, high-accuracy, deployment-ready pruning of massive LLMs without expensive retraining cycles, broadening accessibility of large models in resource-constrained settings.

Abstract

Large language models (LLMs) have seen substantial growth, necessitating efficient model pruning techniques. Existing post-training pruning methods primarily measure weight importance in converged dense models, often overlooking changes in weight significance during the pruning process, leading to performance degradation. To address this issue, we present LLM-Barber (Block-Aware Rebuilder for Sparsity Mask in One-Shot), a novel one-shot pruning framework that rebuilds the sparsity mask of pruned models without any retraining or weight reconstruction. LLM-Barber incorporates block-aware error optimization across Self-Attention and MLP blocks, facilitating global performance optimization. We are the first to employ the product of weights and gradients as a pruning metric in the context of LLM post-training pruning. This enables accurate identification of weight importance in massive models and significantly reduces computational complexity compared to methods using secondorder information. Our experiments show that LLM-Barber efficiently prunes models from LLaMA and OPT families (7B to 13B) on a single A100 GPU in just 30 minutes, achieving state-of-the-art results in both perplexity and zero-shot performance across various language benchmarks. Code is available at https://github.com/YupengSu/LLM-Barber.

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

TL;DR

The paper tackles inefficient post-training pruning of large language models by addressing both inter-block error propagation and the static nature of pruning masks. It introduces LLM-Barber, a one-shot, block-aware sparsity mask rebuilder that uses the product of weights and gradients as a pruning metric to identify salient weights across Self-Attention and MLP blocks, enabling global optimization without retraining. The method demonstrates state-of-the-art perplexity and zero-shot performance on LLaMA and OPT families at 50% sparsity, and integrates smoothly with structured N:M sparsity and hardware-aware quantization (e.g., APTQ) for accelerated deployment on GPUs and FPGAs. These findings highlight a practical pathway for fast, high-accuracy, deployment-ready pruning of massive LLMs without expensive retraining cycles, broadening accessibility of large models in resource-constrained settings.

Abstract

Large language models (LLMs) have seen substantial growth, necessitating efficient model pruning techniques. Existing post-training pruning methods primarily measure weight importance in converged dense models, often overlooking changes in weight significance during the pruning process, leading to performance degradation. To address this issue, we present LLM-Barber (Block-Aware Rebuilder for Sparsity Mask in One-Shot), a novel one-shot pruning framework that rebuilds the sparsity mask of pruned models without any retraining or weight reconstruction. LLM-Barber incorporates block-aware error optimization across Self-Attention and MLP blocks, facilitating global performance optimization. We are the first to employ the product of weights and gradients as a pruning metric in the context of LLM post-training pruning. This enables accurate identification of weight importance in massive models and significantly reduces computational complexity compared to methods using secondorder information. Our experiments show that LLM-Barber efficiently prunes models from LLaMA and OPT families (7B to 13B) on a single A100 GPU in just 30 minutes, achieving state-of-the-art results in both perplexity and zero-shot performance across various language benchmarks. Code is available at https://github.com/YupengSu/LLM-Barber.
Paper Structure (16 sections, 11 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 16 sections, 11 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: The benefits of integrating LLM-Barber into the pruning process: (a) Transition from the layer-aware to block-aware error accumulation to achieve an optimized global solution. (b) Rebuilding sparsity mask using a novel pruning metric based on weights multiplied by gradients.
  • Figure 2: The workflow of LLM-Barber. (a) illustrates the process of block-aware reconstruction error and gradient calculation for each linear weight. (b) shows pruning metric computation and sparsity mask rebuilding.
  • Figure 3: The importance score distribution of mask rebuilding pairs and WikiText-2 perplexity results at varying pruning granularities in LLaMA-7B, with the green dashed line marking the optimal mask rebuilding ratio.
  • Figure 4: Ablation of calibration size in LLaMA3-8B. LLM-Barber is robust across varying calibration size.