One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

Hang Shao; Bei Liu; Bo Xiao; Ke Zeng; Guanglu Wan; Yanmin Qian

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

Hang Shao, Bei Liu, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

TL;DR

The paper tackles the challenge of deploying very large LLMs by enabling high-sparsity, one-shot pruning without retraining. It introduces an Improved Saliency Criterion (ISC) combined with Hessian-sensitivity aware mixed sparsity, using the Hutchinson estimator to approximate the Hessian trace and allocate per-weight or per-layer sparsity within a global budget. Key contributions include the ISC formulation, a practical sparsity-allocation algorithm guided by Hessian sensitivity, and state-of-the-art one-shot pruning performance at 50% sparsity, with compatibility to quantization and public code release. This work advances practical LLM compression, enabling substantial model size/latency reductions while maintaining accuracy, particularly at very high sparsity and when combined with quantization for further deployment efficiency.

Abstract

Various Large Language Models~(LLMs) from the Generative Pretrained Transformer(GPT) family have achieved outstanding performances in a wide range of text generation tasks. However, the enormous model sizes have hindered their practical use in real-world applications due to high inference latency. Therefore, improving the efficiencies of LLMs through quantization, pruning, and other means has been a key issue in LLM studies. In this work, we propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. It allocates sparsity adaptively based on sensitivity, allowing us to reduce pruning-induced error while maintaining the overall sparsity level. The advantages of the proposed method exhibit even more when the sparsity is extremely high. Furthermore, our method is compatible with quantization, enabling further compression of LLMs. We have released the available code.

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

TL;DR

Abstract

Paper Structure (18 sections, 6 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 6 equations, 3 figures, 4 tables, 1 algorithm.

Introduction
Methodology
Mask Selection and Weight Reconstruction Based on OBS
Improved Saliency Criterion
Mixed Sparsity Pruning Based on Hessian Sensitivity Awareness
Calculation of Sensitivity Based on Hessian Average Trace
Rational Mixed Sparsity Assignment Based on Sensitivity
Experiments
Experiment Setup
Model, Dataset and Evaluation
Setup
Experiment Results and Analysis
Evaluation on Improved Saliency Criterion
Evaluation on Mixed Sparsity Pruning
Evaluation on Very High Sparsity Models
...and 3 more sections

Figures (3)

Figure 1: Sensitivity level of different layers for three LLM models: LLaMA-7B, LLaMA2-7B, and Baichuan-7B.
Figure 2: Sensitivity level of different weight components in LLaMA-7B.
Figure 3: Perplexity of different methods with varying sparsity levels evaluated on PTB.

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

TL;DR

Abstract

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)