Table of Contents
Fetching ...

LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation

Haihang Wu

TL;DR

This work tackles the high computational burden of large language models by introducing LLM-BIP, a block-wise structured pruning method that targets FFN channels and MSA heads within transformer blocks. It derives a Lipschitz-based upper bound to compute block-wise importance scores, enabling pruning decisions in a single forward pass without relying on unreliable gradients or Lipschitz assumptions for self-attention. The approach yields objective pruning accuracy gains and practical speedups, outperforming strong baselines on multiple models and zero-shot tasks, particularly at higher sparsities after limited fine-tuning. Overall, LLM-BIP offers a scalable, hardware-friendly pruning strategy with significant perplexity reductions and accuracy improvements, advancing efficient deployment of large language models.

Abstract

Large language models (LLMs) have demonstrated remarkable performance across various language tasks, but their widespread deployment is impeded by their large size and high computational costs. Structural pruning is a prevailing technique used to introduce sparsity into pre-trained models and facilitate direct hardware acceleration during inference by removing redundant connections (structurally-grouped parameters), such as channels and attention heads. Existing structural pruning approaches often employ either global or layer-wise pruning criteria; however, they are hindered by ineffectiveness stemming from inaccurate evaluation of connection importance. Global pruning methods typically assess component importance using near-zero and unreliable gradients, while layer-wise pruning approaches encounter significant pruning error accumulation issues. To this end, we propose a more accurate pruning metric based on the block-wise importance score propagation, termed LLM-BIP. Specifically, LLM-BIP precisely evaluates connection importance by gauging its influence on the respective transformer block output, which can be efficiently approximated in a single forward pass through an upper bound derived from the assumption of Lipschitz continuity. We evaluate the proposed method using LLaMA-7B, Vicuna-7B, and LLaMA-13B across common zero-shot tasks. The results demonstrate that our approach achieves an average of 3.26% increase in accuracy for common reasoning tasks compared to previous best baselines. It also reduces perplexity by 14.09 and 68.76 on average for the WikiText2 dataset and PTB dataset, respectively.

LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation

TL;DR

This work tackles the high computational burden of large language models by introducing LLM-BIP, a block-wise structured pruning method that targets FFN channels and MSA heads within transformer blocks. It derives a Lipschitz-based upper bound to compute block-wise importance scores, enabling pruning decisions in a single forward pass without relying on unreliable gradients or Lipschitz assumptions for self-attention. The approach yields objective pruning accuracy gains and practical speedups, outperforming strong baselines on multiple models and zero-shot tasks, particularly at higher sparsities after limited fine-tuning. Overall, LLM-BIP offers a scalable, hardware-friendly pruning strategy with significant perplexity reductions and accuracy improvements, advancing efficient deployment of large language models.

Abstract

Large language models (LLMs) have demonstrated remarkable performance across various language tasks, but their widespread deployment is impeded by their large size and high computational costs. Structural pruning is a prevailing technique used to introduce sparsity into pre-trained models and facilitate direct hardware acceleration during inference by removing redundant connections (structurally-grouped parameters), such as channels and attention heads. Existing structural pruning approaches often employ either global or layer-wise pruning criteria; however, they are hindered by ineffectiveness stemming from inaccurate evaluation of connection importance. Global pruning methods typically assess component importance using near-zero and unreliable gradients, while layer-wise pruning approaches encounter significant pruning error accumulation issues. To this end, we propose a more accurate pruning metric based on the block-wise importance score propagation, termed LLM-BIP. Specifically, LLM-BIP precisely evaluates connection importance by gauging its influence on the respective transformer block output, which can be efficiently approximated in a single forward pass through an upper bound derived from the assumption of Lipschitz continuity. We evaluate the proposed method using LLaMA-7B, Vicuna-7B, and LLaMA-13B across common zero-shot tasks. The results demonstrate that our approach achieves an average of 3.26% increase in accuracy for common reasoning tasks compared to previous best baselines. It also reduces perplexity by 14.09 and 68.76 on average for the WikiText2 dataset and PTB dataset, respectively.

Paper Structure

This paper contains 12 sections, 4 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: The comparison among global pruning, layer-wise pruning, and our block-wise pruning. Global pruning methods Ma2023LLM-Pruner:Models target on the minimization of the pruning effects on the final model output, typically relying on the memory-intensive and unreliable gradients. Layer-wise pruning techniques Sun2024AModels focus on pruning error minimization on the output of the current layer. Despite its pruning efficiency, it suffers from the rapid pruning error accumulation issue. In contrast, our block-wise pruning strategy aims to minimize the pruning impact on the output of the transformer block, avoiding unreliable gradients and mitigating the error accumulation issue.
  • Figure 2: Comparison of the proposed method, Wanda and LLM-Pruner on LLaMA-7B (left) and Vicuna-7B (right) with different pruning rates without fine-tuning.
  • Figure 3: Our method mitigates the pruning error accumulation issue compared to Wanda. Pruning error is measured by the mean value of the distance defined by Eq. (\ref{['opt obj']}) for each transformer block output. The LLaMA-13B model is pruned to the sparsity of 20% without finetuing.
  • Figure 4: Our method is robust to the calibration set size. The calibration set consists of sentences with a sequence length of 2048 from C4. The LLaMA-7B model is pruned to the sparsity of 20% without finetuing.