Table of Contents
Fetching ...

SlimGPT: Layer-wise Structured Pruning for Large Language Models

Gui Ling, Ziyang Wang, Yuliang Yan, Qingwen Liu

TL;DR

SlimGPT introduces a fast, low-cost approach for structured pruning of large language models by extending the Optimal Brain Surgeon framework. It adds Batched Greedy Pruning to efficiently estimate head-wise and FFN pruning errors via grouped Cholesky decompositions and applies Incremental Pruning Ratio to mitigate layer-wise error accumulation. Empirical results on LLaMA variants demonstrate state-of-the-art performance at 20–50% pruning, with substantial reductions in memory and latency and robust performance under limited calibration data. The work offers a practical path toward deployable, efficient LLMs, while noting limitations at very high pruning ratios and the importance of calibration data choice.

Abstract

Large language models (LLMs) have garnered significant attention for their remarkable capabilities across various domains, whose vast parameter scales present challenges for practical deployment. Structured pruning is an effective method to balance model performance with efficiency, but performance restoration under computational resource constraints is a principal challenge in pruning LLMs. Therefore, we present a low-cost and fast structured pruning method for LLMs named SlimGPT based on the Optimal Brain Surgeon framework. We propose Batched Greedy Pruning for rapid and near-optimal pruning, which enhances the accuracy of head-wise pruning error estimation through grouped Cholesky decomposition and improves the pruning efficiency of FFN via Dynamic Group Size, thereby achieving approximate local optimal pruning results within one hour. Besides, we explore the limitations of layer-wise pruning from the perspective of error accumulation and propose Incremental Pruning Ratio, a non-uniform pruning strategy to reduce performance degradation. Experimental results on the LLaMA benchmark show that SlimGPT outperforms other methods and achieves state-of-the-art results.

SlimGPT: Layer-wise Structured Pruning for Large Language Models

TL;DR

SlimGPT introduces a fast, low-cost approach for structured pruning of large language models by extending the Optimal Brain Surgeon framework. It adds Batched Greedy Pruning to efficiently estimate head-wise and FFN pruning errors via grouped Cholesky decompositions and applies Incremental Pruning Ratio to mitigate layer-wise error accumulation. Empirical results on LLaMA variants demonstrate state-of-the-art performance at 20–50% pruning, with substantial reductions in memory and latency and robust performance under limited calibration data. The work offers a practical path toward deployable, efficient LLMs, while noting limitations at very high pruning ratios and the importance of calibration data choice.

Abstract

Large language models (LLMs) have garnered significant attention for their remarkable capabilities across various domains, whose vast parameter scales present challenges for practical deployment. Structured pruning is an effective method to balance model performance with efficiency, but performance restoration under computational resource constraints is a principal challenge in pruning LLMs. Therefore, we present a low-cost and fast structured pruning method for LLMs named SlimGPT based on the Optimal Brain Surgeon framework. We propose Batched Greedy Pruning for rapid and near-optimal pruning, which enhances the accuracy of head-wise pruning error estimation through grouped Cholesky decomposition and improves the pruning efficiency of FFN via Dynamic Group Size, thereby achieving approximate local optimal pruning results within one hour. Besides, we explore the limitations of layer-wise pruning from the perspective of error accumulation and propose Incremental Pruning Ratio, a non-uniform pruning strategy to reduce performance degradation. Experimental results on the LLaMA benchmark show that SlimGPT outperforms other methods and achieves state-of-the-art results.

Paper Structure

This paper contains 28 sections, 9 equations, 5 figures, 15 tables, 1 algorithm.

Figures (5)

  • Figure 1: The figure illustrates Batched Greedy Pruning on attention blocks, where $W$ is a output matrix and $H$ is the corresponding Hessian. Different colors represent distinct attention heads and gray indicates the pruned weights.
  • Figure 2: Per-layer FFN output error between the original LLaMA-7B and three distinct pruned models. The pruned models each implement a first-layer reduction of 25%, 50%, and 75%, respectively. The PPL of original model is 12.63. For ease of visualization, the layer index has been truncated to 25.
  • Figure 3: Effects of Calibration Sample Size & Sequence Length.
  • Figure 4: Layer-wise pruning ratio on LLaMA-7B with total pruning ratio 50%.
  • Figure 5: Alpaca train loss & Wikitext2 evaluation loss.