Table of Contents
Fetching ...

AWP: Activation-Aware Weight Pruning and Quantization with Projected Gradient Descent

Jing Liu, Toshiaki Koike-Akino, Ye Wang, Hassan Mansour, Matthew Brand

TL;DR

This paper proposes Activation-aware Weight pruning and quantization via Projected Gradient Descent (AWP), a layer-wise post-training compression method for large transformers that leverages activation statistics to improve pruning and quantization. By formulating per-row sparse regression and applying iterative hard thresholding within a PGD framework, AWP unifies pruning and quantization without costly SVDs and offers convergence guarantees for pruning. Empirical results on Llama-2 and Llama-3 models show that AWP outperforms state-of-the-art activation-aware methods across pruning and quantization tasks and benefits from joint pruning-quantization. The work also provides theoretical insights via RIP/RSC/RSM-based guarantees, suggesting promising directions for structured sparsity and quantization theory in large language model compression.

Abstract

To address the enormous size of Large Language Models (LLMs), model compression methods, such as quantization and pruning, are often deployed, especially on edge devices. In this work, we focus on layer-wise post-training quantization and pruning. Drawing connections between activation-aware weight pruning and sparse approximation problems, and motivated by the success of Iterative Hard Thresholding (IHT), we propose a unified method for Activation-aware Weight pruning and quantization via Projected gradient descent (AWP). Our experiments demonstrate that AWP outperforms state-of-the-art LLM pruning and quantization methods. Theoretical convergence guarantees of the proposed method for pruning are also provided.

AWP: Activation-Aware Weight Pruning and Quantization with Projected Gradient Descent

TL;DR

This paper proposes Activation-aware Weight pruning and quantization via Projected Gradient Descent (AWP), a layer-wise post-training compression method for large transformers that leverages activation statistics to improve pruning and quantization. By formulating per-row sparse regression and applying iterative hard thresholding within a PGD framework, AWP unifies pruning and quantization without costly SVDs and offers convergence guarantees for pruning. Empirical results on Llama-2 and Llama-3 models show that AWP outperforms state-of-the-art activation-aware methods across pruning and quantization tasks and benefits from joint pruning-quantization. The work also provides theoretical insights via RIP/RSC/RSM-based guarantees, suggesting promising directions for structured sparsity and quantization theory in large language model compression.

Abstract

To address the enormous size of Large Language Models (LLMs), model compression methods, such as quantization and pruning, are often deployed, especially on edge devices. In this work, we focus on layer-wise post-training quantization and pruning. Drawing connections between activation-aware weight pruning and sparse approximation problems, and motivated by the success of Iterative Hard Thresholding (IHT), we propose a unified method for Activation-aware Weight pruning and quantization via Projected gradient descent (AWP). Our experiments demonstrate that AWP outperforms state-of-the-art LLM pruning and quantization methods. Theoretical convergence guarantees of the proposed method for pruning are also provided.

Paper Structure

This paper contains 13 sections, 3 theorems, 24 equations, 1 figure, 5 tables, 1 algorithm.

Key Result

Theorem 1.1

[blumensath2009iterative] Given a noisy observation $\mathbf{y} = \mathbf{A} \boldsymbol \theta_k +\bm e$, where $\boldsymbol \theta_k$ is $k$-sparse. If $\bm A$ has the restricted isometry property with $\beta_{3k}<1/8$, then, at iteration $t$, IHT will recover an approximation $\boldsymbol \theta Furthermore, after at most $t'= \lceil \log_2(\|\boldsymbol \theta_k \|_2 / \|\bm e\|_2)\rceil$ ite

Figures (1)

  • Figure 1: $\| \mathbf{W}\mathbf{C}^{\frac{1}{2}} -\mathbf{\Theta}^{(t)}\mathbf{C}^{\frac{1}{2}}\|_\mathrm{F}/\| \mathbf{W}\|_\mathrm{F}$, w.r.t. iteration $t$ during AWP pruning of a layer in the Llama-2 7B model.

Theorems & Definitions (7)

  • Theorem 1.1
  • Theorem 1.2
  • Corollary 1.3
  • proof
  • Definition 1.4
  • Definition 1.5
  • Remark 1.6