One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models
Hang Shao, Bei Liu, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian
TL;DR
The paper tackles the challenge of deploying very large LLMs by enabling high-sparsity, one-shot pruning without retraining. It introduces an Improved Saliency Criterion (ISC) combined with Hessian-sensitivity aware mixed sparsity, using the Hutchinson estimator to approximate the Hessian trace and allocate per-weight or per-layer sparsity within a global budget. Key contributions include the ISC formulation, a practical sparsity-allocation algorithm guided by Hessian sensitivity, and state-of-the-art one-shot pruning performance at 50% sparsity, with compatibility to quantization and public code release. This work advances practical LLM compression, enabling substantial model size/latency reductions while maintaining accuracy, particularly at very high sparsity and when combined with quantization for further deployment efficiency.
Abstract
Various Large Language Models~(LLMs) from the Generative Pretrained Transformer(GPT) family have achieved outstanding performances in a wide range of text generation tasks. However, the enormous model sizes have hindered their practical use in real-world applications due to high inference latency. Therefore, improving the efficiencies of LLMs through quantization, pruning, and other means has been a key issue in LLM studies. In this work, we propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. It allocates sparsity adaptively based on sensitivity, allowing us to reduce pruning-induced error while maintaining the overall sparsity level. The advantages of the proposed method exhibit even more when the sparsity is extremely high. Furthermore, our method is compatible with quantization, enabling further compression of LLMs. We have released the available code.
