Table of Contents
Fetching ...

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo

TL;DR

BESA tackles the high cost of large language models by introducing blockwise sparsity allocation with a differentiable block-wise reconstruction loss and a parameter-efficient sparsity learning mechanism. By pruning per transformer block and learning row- or layer-wise pruning rates via differentiable binary masks, BESA allocates sparsity where it matters most, reducing performance degradation compared to layer-wise pruning. The approach supports joint optimization with 4-bit weight-only quantization (OmniQuant) and demonstrates state-of-the-art perplexity and zero-shot performance on LLaMA and LLaMA2 scales, with practical hardware speedups on the ViTCoD accelerator. These results indicate that blockwise, differentiable sparsity with quantization can yield substantial compression and speedups for extremely large models on commodity GPUs.

Abstract

Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer-wise approach results in significant perturbation to the model's output and requires meticulous hyperparameter tuning, such as the pruning rate, which can adversely affect overall model performance. To address this, this paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss. In contrast to the typical layer-wise pruning techniques, BESA is characterized by two distinctive attributes: i) it targets the overall pruning error with respect to individual transformer blocks, and ii) it allocates layer-specific sparsity in a differentiable manner, both of which ensure reduced performance degradation after pruning. Our experiments show that BESA achieves state-of-the-art performance, efficiently pruning LLMs like LLaMA1, and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. Code is available at https://github.com/OpenGVLab/LLMPrune-BESA.

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

TL;DR

BESA tackles the high cost of large language models by introducing blockwise sparsity allocation with a differentiable block-wise reconstruction loss and a parameter-efficient sparsity learning mechanism. By pruning per transformer block and learning row- or layer-wise pruning rates via differentiable binary masks, BESA allocates sparsity where it matters most, reducing performance degradation compared to layer-wise pruning. The approach supports joint optimization with 4-bit weight-only quantization (OmniQuant) and demonstrates state-of-the-art perplexity and zero-shot performance on LLaMA and LLaMA2 scales, with practical hardware speedups on the ViTCoD accelerator. These results indicate that blockwise, differentiable sparsity with quantization can yield substantial compression and speedups for extremely large models on commodity GPUs.

Abstract

Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer-wise approach results in significant perturbation to the model's output and requires meticulous hyperparameter tuning, such as the pruning rate, which can adversely affect overall model performance. To address this, this paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss. In contrast to the typical layer-wise pruning techniques, BESA is characterized by two distinctive attributes: i) it targets the overall pruning error with respect to individual transformer blocks, and ii) it allocates layer-specific sparsity in a differentiable manner, both of which ensure reduced performance degradation after pruning. Our experiments show that BESA achieves state-of-the-art performance, efficiently pruning LLMs like LLaMA1, and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. Code is available at https://github.com/OpenGVLab/LLMPrune-BESA.
Paper Structure (16 sections, 7 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 7 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) shows that layer-wise pruning methods such as Wanda sun2023simple produce a more significant error than our block-wise pruning technique BESA. (b) presents the curves of perplexity v.s. sparsity for different layers on WikiText2 merity2016wikitext. We see that layers do not contribute equally to the final performance. (c) shows that prior works prune all linear projections in the transformer block by layer reconstruction. (d) expresses that our proposed BESA compresses LLMs under a block-wise reconstruction pipeline.
  • Figure 2: The pipeline of our BESA. (a) shows that BESA prunes weights in the self-attention and feed-forward networks by block reconstruction, which enables efficient and differentiable search for layer-specific pruning rates. (b) describes that weight pruning is achieved by differentiable binary masks which are obtained in a parameter-efficient way by taking weights' importance into modeling. Note that only a small number of ratios $\{\beta_d\}_{d=1}^D$ are learnable during pruning while the original weights in the LLM are frozen.
  • Figure 3: Model sparsity ablation
  • Figure 4: Calibration size ablation
  • Figure 5: Reconstruction error for learning granularities.
  • ...and 2 more figures