Table of Contents
Fetching ...

SlimLLM: Accurate Structured Pruning for Large Language Models

Jialong Guo, Xinghao Chen, Yehui Tang, Yunhe Wang

TL;DR

SlimLLM tackles the high computational cost of large language models by proposing a fast, structured pruning framework that treats sub-modules holistically. It uses Pearson-based similarity to assess MHA head importance, PCA-based feature-space scores for FFN channels, and a lightweight linear regression step to recover performance, with layer-wise pruning ratios guided by layer input-output cosine similarity. The method demonstrates strong zero-shot performance retention on LLaMA models at 20% pruning, substantial latency reductions, and robust ablation results showing each component’s contribution. These contributions offer a practical, hardware-friendly approach to compressing LLMs without large-scale retraining, enabling more accessible deployment in resource-constrained settings.

Abstract

Large language models(LLMs) have garnered significant attention and demonstrated impressive capabilities in a wide range of applications. However, due to their enormous computational costs, the deployment and application of LLMs are often severely limited. To address this issue, structured pruning is an effective solution to compress the parameters of LLMs. Determining the importance of each sub-module in LLMs and minimizing performance loss are critical issues that need to be carefully addressed in structured pruning. In this paper, we propose an effective and fast structured pruning method named SlimLLM for large language models. For channel and attention head pruning, we evaluate the importance based on the entire channel or head, rather than merely aggregating the importance of individual elements within a sub-module. This approach enables a more holistic consideration of the interdependence among elements within the sub-module. In addition, we design a simple linear regression strategy for the output matrix to quickly recover performance. We also propose layer-based importance ratio to determine the pruning ratio for each layer. Based on the LLaMA benchmark results, our SlimLLM outperforms other methods and achieves state-of-the-art performance.

SlimLLM: Accurate Structured Pruning for Large Language Models

TL;DR

SlimLLM tackles the high computational cost of large language models by proposing a fast, structured pruning framework that treats sub-modules holistically. It uses Pearson-based similarity to assess MHA head importance, PCA-based feature-space scores for FFN channels, and a lightweight linear regression step to recover performance, with layer-wise pruning ratios guided by layer input-output cosine similarity. The method demonstrates strong zero-shot performance retention on LLaMA models at 20% pruning, substantial latency reductions, and robust ablation results showing each component’s contribution. These contributions offer a practical, hardware-friendly approach to compressing LLMs without large-scale retraining, enabling more accessible deployment in resource-constrained settings.

Abstract

Large language models(LLMs) have garnered significant attention and demonstrated impressive capabilities in a wide range of applications. However, due to their enormous computational costs, the deployment and application of LLMs are often severely limited. To address this issue, structured pruning is an effective solution to compress the parameters of LLMs. Determining the importance of each sub-module in LLMs and minimizing performance loss are critical issues that need to be carefully addressed in structured pruning. In this paper, we propose an effective and fast structured pruning method named SlimLLM for large language models. For channel and attention head pruning, we evaluate the importance based on the entire channel or head, rather than merely aggregating the importance of individual elements within a sub-module. This approach enables a more holistic consideration of the interdependence among elements within the sub-module. In addition, we design a simple linear regression strategy for the output matrix to quickly recover performance. We also propose layer-based importance ratio to determine the pruning ratio for each layer. Based on the LLaMA benchmark results, our SlimLLM outperforms other methods and achieves state-of-the-art performance.

Paper Structure

This paper contains 16 sections, 11 equations, 2 figures, 9 tables, 1 algorithm.

Figures (2)

  • Figure 1: The overall framework of our proposed SlimLLM. $o_{-h_{i}}$ denotes the output excluding the $i$-th head. For the MHA sub-layer, we employ the Pearson similarity between $o_{-h_{i}}$ and the sum of all heads' output $o$ to evaluate the importance of each head, and prune the head with higher similarity when it is inoperative. For the FFN sub-layer, we map down matrix to the feature space of the output activation, and calculate the channel importance based on the eigenvalues corresponding to the eigenvectors. Finally, we apply linear regression to fine-tune the output matrix of each sub-layer.
  • Figure 2: Different layers' mean value of the coefficients $A$ in MHA and FFN on LLaMA-7B.