Towards Efficient Automatic Self-Pruning of Large Language Models

Weizhong Huang; Yuxin Zhang; Xiawu Zheng; Fei Chao; Rongrong Ji

Towards Efficient Automatic Self-Pruning of Large Language Models

Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji

TL;DR

This work tackles the deployment burden of large language models by proposing Self-Pruner, an end-to-end framework that lets LLMs autonomously run an evolutionary search to determine layer-wise pruning rates under a fixed average pruning constraint $\frac{1}{n}\sum_{i=1}^n p_i=\beta$. By encoding population generation, selection, crossover, and mutation as prompts for LLMs, the method leverages intrinsic redundancy knowledge to efficiently explore pruning configurations and evaluate them via perplexity on WikiText-2, targeting minimal accuracy loss. Empirical results on LLaMA-1/2/3 and Vicuna show Self-Pruner outperforms prior post-training pruning methods, achieving substantial inference-speed gains (e.g., up to $1.82\times$) while maintaining competitive zero-shot performance across seven tasks, and demonstrating notable gains especially for larger models like LLaMA-2-70B. Although fully automated compression remains an open goal, the approach significantly reduces human intervention in pruning design and highlights the viability of LLM-driven optimization for practical model compression.

Abstract

Despite exceptional capabilities, Large Language Models (LLMs) still face deployment challenges due to their enormous size. Post-training structured pruning is a promising solution that prunes LLMs without the need for retraining, reducing computational overhead, and it is hardware-deployment friendly. However, the training-free nature of post-training structured pruning leads to significant performance degradation. We argue that the key to mitigating this issue lies in accurately determining the pruning rate for each layer. Meanwhile, we find that LLMs may have prior knowledge about their own redundancy. Based on this insight, we introduce $\textbf{Self-Pruner}$ an end-to-end automatic self-pruning framework for LLMs, which efficiently search layer-wise pruning rates. Specifically, $\textbf{Self-Pruner}$ leverages LLMs to autonomously execute the entire evolutionary search process to search for pruning rate configurations. In this process, LLMs are used to generate populations, select parent solutions from the current population, and perform crossover and mutation operations to produce offspring solutions. In this way, LLMs automatically generate and evaluate a large number of candidate solutions, effectively converging to find the pruning rate configurations with minimal human intervention. Extensive experiments demonstrate $\textbf{Self-Pruner}$'s better performance compared to existing state-of-the-art methods. Notably, $\textbf{Self-Pruner}$ prunes LLaMA-2-70B to 49B level with only 0.80$\%$ drop in accuracy across seven commonsense reasoning tasks, achieving a 1.39$\times$ speedup on NVIDIA A100 80GB GPU. Further pruning to 35B level resulted in only a 3.80$\%$ decrease in accuracy while obtaining a 1.70$\times$ speedup.

Towards Efficient Automatic Self-Pruning of Large Language Models

TL;DR

. By encoding population generation, selection, crossover, and mutation as prompts for LLMs, the method leverages intrinsic redundancy knowledge to efficiently explore pruning configurations and evaluate them via perplexity on WikiText-2, targeting minimal accuracy loss. Empirical results on LLaMA-1/2/3 and Vicuna show Self-Pruner outperforms prior post-training pruning methods, achieving substantial inference-speed gains (e.g., up to

) while maintaining competitive zero-shot performance across seven tasks, and demonstrating notable gains especially for larger models like LLaMA-2-70B. Although fully automated compression remains an open goal, the approach significantly reduces human intervention in pruning design and highlights the viability of LLM-driven optimization for practical model compression.

Abstract

an end-to-end automatic self-pruning framework for LLMs, which efficiently search layer-wise pruning rates. Specifically,

leverages LLMs to autonomously execute the entire evolutionary search process to search for pruning rate configurations. In this process, LLMs are used to generate populations, select parent solutions from the current population, and perform crossover and mutation operations to produce offspring solutions. In this way, LLMs automatically generate and evaluate a large number of candidate solutions, effectively converging to find the pruning rate configurations with minimal human intervention. Extensive experiments demonstrate

's better performance compared to existing state-of-the-art methods. Notably,

prunes LLaMA-2-70B to 49B level with only 0.80

drop in accuracy across seven commonsense reasoning tasks, achieving a 1.39

speedup on NVIDIA A100 80GB GPU. Further pruning to 35B level resulted in only a 3.80

decrease in accuracy while obtaining a 1.70

speedup.

Towards Efficient Automatic Self-Pruning of Large Language Models

TL;DR

Abstract

Towards Efficient Automatic Self-Pruning of Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)