Table of Contents
Fetching ...

CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information

Yuxin Wang, Minghua Ma, Zekun Wang, Jingchang Chen, Huiming Fan, Liping Shan, Qing Yang, Dongliang Xu, Ming Liu, Bing Qin

TL;DR

CFSP proposes an activation-driven, coarse-to-fine structured pruning framework for LLMs, focusing on pruning the FFN intermediate dimension while preserving hardware efficiency. By computing block-level saliency via angular activation distances and intra-block fine-grained scores, CFSP allocates sparsity nonuniformly across blocks and adjusts dimensions to multiples of 128. An efficient IG-LoRA recovery scheme allocates per-block trainable capacity based on coarse-grained importance, improving performance with limited recovery data. Empirical results across multiple LLaMA and Qwen models show CFSP outperforms baselines in zero-shot tasks and language modeling, while delivering substantial speed-ups (≈1.5x) and substantial parameter reductions at high sparsity. These findings suggest CFSP as a practical, hardware-friendly pruning approach for deploying large language models in real-world settings.

Abstract

The colossal parameters and computational overhead of Large Language Models (LLMs) challenge their real-world applications. Network pruning, which targets unstructured or structured sparsity by removing redundant parameters, has recently been explored for LLM acceleration. Existing LLM pruning works focus on unstructured pruning, which typically requires special hardware support for a practical speed-up. In contrast, structured pruning can reduce latency on general devices. However, it remains a challenge to perform structured pruning efficiently and maintain performance, especially at high sparsity ratios. To this end, we introduce an efficient structured pruning framework named CFSP, which leverages both Coarse (interblock) and Fine-grained (intrablock) activation information as an importance criterion to guide pruning. The pruning is highly efficient, as it only requires one forward pass to compute feature activations. Specifically, we first allocate the sparsity budget across blocks based on their importance and then retain important weights within each block. In addition, we introduce a recovery fine-tuning strategy that adaptively allocates training overhead based on coarse-grained importance to further improve performance. Experimental results demonstrate that CFSP outperforms existing methods on diverse models across various sparsity budgets. Our code will be available at https://github.com/wyxscir/CFSP.

CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information

TL;DR

CFSP proposes an activation-driven, coarse-to-fine structured pruning framework for LLMs, focusing on pruning the FFN intermediate dimension while preserving hardware efficiency. By computing block-level saliency via angular activation distances and intra-block fine-grained scores, CFSP allocates sparsity nonuniformly across blocks and adjusts dimensions to multiples of 128. An efficient IG-LoRA recovery scheme allocates per-block trainable capacity based on coarse-grained importance, improving performance with limited recovery data. Empirical results across multiple LLaMA and Qwen models show CFSP outperforms baselines in zero-shot tasks and language modeling, while delivering substantial speed-ups (≈1.5x) and substantial parameter reductions at high sparsity. These findings suggest CFSP as a practical, hardware-friendly pruning approach for deploying large language models in real-world settings.

Abstract

The colossal parameters and computational overhead of Large Language Models (LLMs) challenge their real-world applications. Network pruning, which targets unstructured or structured sparsity by removing redundant parameters, has recently been explored for LLM acceleration. Existing LLM pruning works focus on unstructured pruning, which typically requires special hardware support for a practical speed-up. In contrast, structured pruning can reduce latency on general devices. However, it remains a challenge to perform structured pruning efficiently and maintain performance, especially at high sparsity ratios. To this end, we introduce an efficient structured pruning framework named CFSP, which leverages both Coarse (interblock) and Fine-grained (intrablock) activation information as an importance criterion to guide pruning. The pruning is highly efficient, as it only requires one forward pass to compute feature activations. Specifically, we first allocate the sparsity budget across blocks based on their importance and then retain important weights within each block. In addition, we introduce a recovery fine-tuning strategy that adaptively allocates training overhead based on coarse-grained importance to further improve performance. Experimental results demonstrate that CFSP outperforms existing methods on diverse models across various sparsity budgets. Our code will be available at https://github.com/wyxscir/CFSP.
Paper Structure (49 sections, 6 equations, 9 figures, 13 tables)

This paper contains 49 sections, 6 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Illustration of our proposed CFSP framework. (a) Pruning with coarse (interblock) to fine (intrablock) activation information guidance. (b) Recovery fine-tuning with importance-guided allocation, where the rank sizes of each component are determined by coarse-grained importance.
  • Figure 2: Preliminary analysis. (Left): Parameter size and MACs of modules. (Right): Sensitivity of pruning each module on LLaMA2-7B.
  • Figure 3: The structural dependencies of FFN in LLaMA3. The blue part corresponds to the minimum unit of structured pruning. The red box represents the relative size of a matrix element in its row or column.
  • Figure 4: Results of different recovery fine-tuning methods at different data sizes.$r=$ 8/32 means the average rank budget configuration is set to 8 or 32.
  • Figure 5: Performance comparison of MMLU task between CFSP and Wanda-SP on LLaMA3-8B with various sparsity.
  • ...and 4 more figures