Table of Contents
Fetching ...

Towards Extreme Pruning of LLMs with Plug-and-Play Mixed Sparsity

Chi Xu, Gefei Zhang, Yantong Zhu, Luca Benini, Guosheng Hu, Yawei Li, Zhihong Zhang

TL;DR

The paper tackles extreme sparsity pruning for LLMs by identifying heterogeneous layer sensitivities via the Fisher Information Matrix trace and optimizing per-layer sparsity with a pruning-oriented evolutionary algorithm. MSP initializes the search with sensitivity-informed priors and outputs per-layer sparsity patterns that can be plugged into existing pruning methods, improving performance at high sparsity. Across LLaMA, LLaMA-2, and OPT, MSP yields substantial perplexity reductions and zero-shot accuracy gains at $75\%$ sparsity, demonstrating strong practical impact for deploying efficient, accurate LLMs. This modular framework enables scalable, layer-aware pruning that outperforms uniform sparsity baselines and enhances down-stream robustness in language modeling and reasoning tasks.

Abstract

N:M structured pruning is essential for large language models (LLMs) because it can remove less important network weights and reduce the memory and computation requirements. Existing pruning methods mainly focus on designing metrics to measure the importance of network components to guide pruning. Apart from the impact of these metrics, we observe that different layers have different sensitivities over the network performance. Thus, we propose an efficient method based on the trace of Fisher Information Matrix (FIM) to quantitatively measure and verify the different sensitivities across layers. Based on this, we propose Mixed Sparsity Pruning (MSP) which uses a pruning-oriented evolutionary algorithm (EA) to determine the optimal sparsity levels for different layers. To guarantee fast convergence and achieve promising performance, we utilize efficient FIM-inspired layer-wise sensitivity to initialize the population of EA. In addition, our MSP can work as a plug-and-play module, ready to be integrated into existing pruning methods. Extensive experiments on LLaMA and LLaMA-2 on language modeling and zero-shot tasks demonstrate our superior performance. In particular, in extreme pruning ratio (e.g. 75%), our method significantly outperforms existing methods in terms of perplexity (PPL) by orders of magnitude (Figure 1).

Towards Extreme Pruning of LLMs with Plug-and-Play Mixed Sparsity

TL;DR

The paper tackles extreme sparsity pruning for LLMs by identifying heterogeneous layer sensitivities via the Fisher Information Matrix trace and optimizing per-layer sparsity with a pruning-oriented evolutionary algorithm. MSP initializes the search with sensitivity-informed priors and outputs per-layer sparsity patterns that can be plugged into existing pruning methods, improving performance at high sparsity. Across LLaMA, LLaMA-2, and OPT, MSP yields substantial perplexity reductions and zero-shot accuracy gains at sparsity, demonstrating strong practical impact for deploying efficient, accurate LLMs. This modular framework enables scalable, layer-aware pruning that outperforms uniform sparsity baselines and enhances down-stream robustness in language modeling and reasoning tasks.

Abstract

N:M structured pruning is essential for large language models (LLMs) because it can remove less important network weights and reduce the memory and computation requirements. Existing pruning methods mainly focus on designing metrics to measure the importance of network components to guide pruning. Apart from the impact of these metrics, we observe that different layers have different sensitivities over the network performance. Thus, we propose an efficient method based on the trace of Fisher Information Matrix (FIM) to quantitatively measure and verify the different sensitivities across layers. Based on this, we propose Mixed Sparsity Pruning (MSP) which uses a pruning-oriented evolutionary algorithm (EA) to determine the optimal sparsity levels for different layers. To guarantee fast convergence and achieve promising performance, we utilize efficient FIM-inspired layer-wise sensitivity to initialize the population of EA. In addition, our MSP can work as a plug-and-play module, ready to be integrated into existing pruning methods. Extensive experiments on LLaMA and LLaMA-2 on language modeling and zero-shot tasks demonstrate our superior performance. In particular, in extreme pruning ratio (e.g. 75%), our method significantly outperforms existing methods in terms of perplexity (PPL) by orders of magnitude (Figure 1).

Paper Structure

This paper contains 26 sections, 10 equations, 8 figures, 8 tables, 5 algorithms.

Figures (8)

  • Figure 1: Perplexity of LLaMA-13B model pruned by different methods with and without our MSP with 75% sparsity on WikiText dataset (lower is better).
  • Figure 2: 1-D loss landscape for different layers of LLaMA-2-7B on C4 datasets raffel2020exploring. The landscape is plotted by perturbing model weights along the trace of Fisher Information Matrix of each layer, with a magnitude of $\epsilon$($i.e.$, $\epsilon=0$ corresponds to no perturbation).
  • Figure 3: The trace of Hessian matrix of LLaMA-13B.
  • Figure 4: Comparison between random initialization and sensitivity-informed initialization on LLaMA-7B.
  • Figure 5: Ablation study on mutation rates (left) and population sizes (right) with LLaMA-7B model.
  • ...and 3 more figures