Table of Contents
Fetching ...

Breaking Expert Knowledge Limits: Self-Pruning for Large Language Models

Haidong Kang, Lihong Lin, Enneng Yang, Hongning Dai, Hao Wang

TL;DR

The paper tackles the challenge of deploying large language models by eliminating the need for expert-crafted pruning algorithms. It introduces AutoPrune, a paradigm where LLMs autonomously design pruning strategies, guided by Graph-driven Chain-of-Thought (GCoT) and Skew-aware Dynamic Sparsity Allocation (SDSA) to address black-box reasoning and the outlier value issue caused by uniform sparsity. Through extensive experiments on WikiText and zero-shot tasks across multiple LLaMA variants, AutoPrune achieves state-of-the-art performance among training-free pruning methods and demonstrates strong generalization to unseen architectures, with robustness under aggressive sparsity and without weight updates. The work significantly reduces labor costs, enhances scalability, and lays groundwork for automated, adaptive sparsity design applicable to broader model families, including multimodal extensions in the future.

Abstract

Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing pruning methods (e.g., Wanda) tailored for LLMs rely heavily on manual design pruning algorithms, thereby leading to \textit{huge labor costs} and \textit{requires expert knowledge}. Furthermore, we are the first to identify the serious \textit{outlier value issue} behind dramatic performance degradation under high pruning ratios that are caused by uniform sparsity, raising an additional concern about how to design adaptive pruning sparsity ideal for LLMs. Can LLMs prune by themselves? In this work, we introduce an affirmative answer by proposing a novel pruning method called \textbf{AutoPrune}, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithms for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of LLMs, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the pruning algorithm and enabling us to generate pruning algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high pruning ratios. We conduct extensive experiments on mainstream LLMs benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors. The code is available at: https://anonymous.4open.science/r/AutoPrune.

Breaking Expert Knowledge Limits: Self-Pruning for Large Language Models

TL;DR

The paper tackles the challenge of deploying large language models by eliminating the need for expert-crafted pruning algorithms. It introduces AutoPrune, a paradigm where LLMs autonomously design pruning strategies, guided by Graph-driven Chain-of-Thought (GCoT) and Skew-aware Dynamic Sparsity Allocation (SDSA) to address black-box reasoning and the outlier value issue caused by uniform sparsity. Through extensive experiments on WikiText and zero-shot tasks across multiple LLaMA variants, AutoPrune achieves state-of-the-art performance among training-free pruning methods and demonstrates strong generalization to unseen architectures, with robustness under aggressive sparsity and without weight updates. The work significantly reduces labor costs, enhances scalability, and lays groundwork for automated, adaptive sparsity design applicable to broader model families, including multimodal extensions in the future.

Abstract

Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing pruning methods (e.g., Wanda) tailored for LLMs rely heavily on manual design pruning algorithms, thereby leading to \textit{huge labor costs} and \textit{requires expert knowledge}. Furthermore, we are the first to identify the serious \textit{outlier value issue} behind dramatic performance degradation under high pruning ratios that are caused by uniform sparsity, raising an additional concern about how to design adaptive pruning sparsity ideal for LLMs. Can LLMs prune by themselves? In this work, we introduce an affirmative answer by proposing a novel pruning method called \textbf{AutoPrune}, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithms for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of LLMs, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the pruning algorithm and enabling us to generate pruning algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high pruning ratios. We conduct extensive experiments on mainstream LLMs benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors. The code is available at: https://anonymous.4open.science/r/AutoPrune.

Paper Structure

This paper contains 28 sections, 8 equations, 10 figures, 17 tables.

Figures (10)

  • Figure 1: (a) AutoPrune v.s. (b) Manual Design. Manual design requires expert knowledge, resulting in huge labor costs. In contrast, our AutoPrune can efficiently design several specialized pruning algorithms by leveraging LLMs.
  • Figure 2: Our AutoPrune v.s. peer competitors on 7 zero-shot tasks. (a) LLaMA-1 7b. (b) LLaMA-2 7b. (c) LLaMA-2 13b.
  • Figure 3: Validation on outlier value issue by skewness. Top: per-layer skewness. Bottom: mean skewness.
  • Figure 4: Validation of layer sensitivity to pruning ratios.
  • Figure 5: SDAS v.s. Uniform allocation at 70% sparsity.
  • ...and 5 more figures