Table of Contents
Fetching ...

Adaptive Pruning for Large Language Models with Structural Importance Awareness

Haotian Zheng, Jinke Ren, Yushan Sun, Ruichen Zhang, Wenbo Zhang, Zhen Li, Dusit Niyato, Shuguang Cui, Yatong Han

TL;DR

The paper tackles the challenge of deploying large language models on resource-constrained devices by introducing SAAP, a structurally-aware adaptive pruning framework. SAAP uses an adaptive importance fusion metric that combines coarse- and fine-grained scores under a homoscedastic uncertainty model and an adaptive structure search guided by an importance fluctuation indicator, together with a group-wise fine-tuning strategy that quantizes and updates weights efficiently. Empirical results on LLaMA and Vicuna families show SAAP achieves about 2–3 percentage point accuracy gains and roughly 5% faster token generation at modest pruning ratios, outperforming several baselines and generalizing across multiple model families. The approach addresses key pruning challenges—multi-metric importance estimation, non-uniform layer-wise pruning, and memory-efficient fine-tuning—offering a practical path to edge-ready LLMs with preserved performance. These findings suggest SAAP enables scalable, efficient deployment of large models in real-world, resource-limited settings.

Abstract

The recent advancements in large language models (LLMs) have significantly improved language understanding and generation capabilities. However, it is difficult to deploy LLMs on resource-constrained edge devices due to their high computational and storage resource demands. To address this issue, we propose a novel LLM model pruning method, namely structurally-aware adaptive pruning (SAAP), to significantly reduce the computational and memory costs while maintaining model performance. We first define an adaptive importance fusion metric to evaluate the importance of all coupled structures in LLMs by considering their homoscedastic uncertainty. Then, we rank the importance of all modules to determine the specific layers that should be pruned to meet particular performance requirements. Furthermore, we develop a new group fine-tuning strategy to improve the inference efficiency of LLMs. Finally, we evaluate the proposed SAAP method on multiple LLMs across two common tasks, i.e., zero-shot classification and text generation. Experimental results show that our SAAP method outperforms several state-of-the-art baseline methods, achieving 2.17%, 2.37%, and 2.39% accuracy gains on LLaMA-7B, Vicuna-7B, and LLaMA-13B. Additionally, SAAP improves the token generation speed by 5%, showcasing its practical advantages in resource-constrained scenarios.

Adaptive Pruning for Large Language Models with Structural Importance Awareness

TL;DR

The paper tackles the challenge of deploying large language models on resource-constrained devices by introducing SAAP, a structurally-aware adaptive pruning framework. SAAP uses an adaptive importance fusion metric that combines coarse- and fine-grained scores under a homoscedastic uncertainty model and an adaptive structure search guided by an importance fluctuation indicator, together with a group-wise fine-tuning strategy that quantizes and updates weights efficiently. Empirical results on LLaMA and Vicuna families show SAAP achieves about 2–3 percentage point accuracy gains and roughly 5% faster token generation at modest pruning ratios, outperforming several baselines and generalizing across multiple model families. The approach addresses key pruning challenges—multi-metric importance estimation, non-uniform layer-wise pruning, and memory-efficient fine-tuning—offering a practical path to edge-ready LLMs with preserved performance. These findings suggest SAAP enables scalable, efficient deployment of large models in real-world, resource-limited settings.

Abstract

The recent advancements in large language models (LLMs) have significantly improved language understanding and generation capabilities. However, it is difficult to deploy LLMs on resource-constrained edge devices due to their high computational and storage resource demands. To address this issue, we propose a novel LLM model pruning method, namely structurally-aware adaptive pruning (SAAP), to significantly reduce the computational and memory costs while maintaining model performance. We first define an adaptive importance fusion metric to evaluate the importance of all coupled structures in LLMs by considering their homoscedastic uncertainty. Then, we rank the importance of all modules to determine the specific layers that should be pruned to meet particular performance requirements. Furthermore, we develop a new group fine-tuning strategy to improve the inference efficiency of LLMs. Finally, we evaluate the proposed SAAP method on multiple LLMs across two common tasks, i.e., zero-shot classification and text generation. Experimental results show that our SAAP method outperforms several state-of-the-art baseline methods, achieving 2.17%, 2.37%, and 2.39% accuracy gains on LLaMA-7B, Vicuna-7B, and LLaMA-13B. Additionally, SAAP improves the token generation speed by 5%, showcasing its practical advantages in resource-constrained scenarios.

Paper Structure

This paper contains 19 sections, 11 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: The pipeline of existing LLM pruning methods.
  • Figure 2: An overview of the SAAP method. Given a foundation LLM, SAAP first removes the most volatile structure by adaptive importance assessment. Then, it restores the performance of the pruned model through efficient group-wise fine-tuning.
  • Figure 3: Average of adaptive importance fusion metrics of each layer in different LLMs.
  • Figure 4: LLM's answer under different pruning ratios.
  • Figure 5: The results of SAAP and LLM-pruner at different pruning ratios. (a) and (b) show the results of the Vicuna-7B model on the PTB and WikiText2 datasets, respectively. (c) and (d) show the results of the LLaMA-13B model on the PTB and WikiText2 datasets, respectively.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Remark 1
  • Remark 2
  • Remark 3