Table of Contents
Fetching ...

FASP: Fast and Accurate Structured Pruning of Large Language Models

Hanyu Hu, Pengxiang Zhao, Ping Li, Yi Zheng, Zhefeng Wang, Xiaoming Yuan

TL;DR

FASP tackles the resource demands of large language models by introducing a fast, accurate structured pruning approach that links sequential linear layers to prune correlated columns and rows, guided by a Wanda-inspired scoring metric. A closed-form, least-squares restoration step recovers fidelity after pruning, enabling substantial speedups without retraining. Across OPT and LLaMA models, FASP outperforms baselines in perplexity and zero-shot tasks while dramatically reducing pruning time on a single RTX 4090, making large-model deployment more practical on resource-constrained hardware.

Abstract

The rapid increase in the size of large language models (LLMs) has significantly escalated their computational and memory demands, posing challenges for efficient deployment, especially on resource-constrained devices. Structured pruning has emerged as an effective model compression method that can reduce these demands while preserving performance. In this paper, we introduce FASP (Fast and Accurate Structured Pruning), a novel structured pruning framework for LLMs that emphasizes both speed and accuracy. FASP employs a distinctive pruning structure that interlinks sequential layers, allowing for the removal of columns in one layer while simultaneously eliminating corresponding rows in the preceding layer without incurring additional performance loss. The pruning metric, inspired by Wanda, is computationally efficient and effectively selects components to prune. Additionally, we propose a restoration mechanism that enhances model fidelity by adjusting the remaining weights post-pruning. We evaluate FASP on the OPT and LLaMA model families, demonstrating superior performance in terms of perplexity and accuracy on downstream tasks compared to state-of-the-art methods. Our approach achieves significant speed-ups, pruning models such as OPT-125M in 17 seconds and LLaMA-30B in 15 minutes on a single NVIDIA RTX 4090 GPU, making it a highly practical solution for optimizing LLMs.

FASP: Fast and Accurate Structured Pruning of Large Language Models

TL;DR

FASP tackles the resource demands of large language models by introducing a fast, accurate structured pruning approach that links sequential linear layers to prune correlated columns and rows, guided by a Wanda-inspired scoring metric. A closed-form, least-squares restoration step recovers fidelity after pruning, enabling substantial speedups without retraining. Across OPT and LLaMA models, FASP outperforms baselines in perplexity and zero-shot tasks while dramatically reducing pruning time on a single RTX 4090, making large-model deployment more practical on resource-constrained hardware.

Abstract

The rapid increase in the size of large language models (LLMs) has significantly escalated their computational and memory demands, posing challenges for efficient deployment, especially on resource-constrained devices. Structured pruning has emerged as an effective model compression method that can reduce these demands while preserving performance. In this paper, we introduce FASP (Fast and Accurate Structured Pruning), a novel structured pruning framework for LLMs that emphasizes both speed and accuracy. FASP employs a distinctive pruning structure that interlinks sequential layers, allowing for the removal of columns in one layer while simultaneously eliminating corresponding rows in the preceding layer without incurring additional performance loss. The pruning metric, inspired by Wanda, is computationally efficient and effectively selects components to prune. Additionally, we propose a restoration mechanism that enhances model fidelity by adjusting the remaining weights post-pruning. We evaluate FASP on the OPT and LLaMA model families, demonstrating superior performance in terms of perplexity and accuracy on downstream tasks compared to state-of-the-art methods. Our approach achieves significant speed-ups, pruning models such as OPT-125M in 17 seconds and LLaMA-30B in 15 minutes on a single NVIDIA RTX 4090 GPU, making it a highly practical solution for optimizing LLMs.
Paper Structure (11 sections, 7 equations, 4 figures, 6 tables)

This paper contains 11 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Illustration of the proposed pruning structure on the OPT model. In this approach, columns of $W_{\text{$fc_2$}}$ are removed along with the corresponding rows of $W_{\text{$fc_1$}}$ without impacting performance, thanks to the inherited position mapping in matrix multiplication. The same principle applies to columns of $W_{\text{out}}$ and rows of $W_V$, as well as the rows of $W_{\text{Q}}$ and $W_K$.
  • Figure 2: Illustration of the modified Wanda's metric for structured pruning, which removes the columns of $W$ where the corresponding columns in $S$ have smaller column-wise sums.
  • Figure 3: Comparative analysis of sparsity versus perplexity across different methods for OPT-1.3B and OPT-2.7B models on WikiText dataset.
  • Figure 4: Comparative analysis of sparsity versus perplexity across different methods for LLaMA-7B and LLaMA-13B models on WikiText dataset.