Table of Contents
Fetching ...

Fast and Effective Weight Update for Pruned Large Language Models

Vladimír Boža

TL;DR

The paper tackles the challenge of pruning large language models without expensive fine-tuning by introducing a fast, ADMM-based weight-update mechanism for pruned layers, combined with mask preconditioning and a gradual pruning schedule. The core objective is to minimize the reconstruction error $\|XW - X (M \odot \widehat{W})\|_2^2$ while enforcing the pruning mask, solved efficiently via ADMM with a single matrix inverse $X^T X$ and few iterations. Empirically, the method achieves state-of-the-art pruning performance across LLaMA-7B and multiple LLaMA-2 variants, often outperforming prior approaches like Wanda and SparseGPT while maintaining low overhead. This approach enables more practical deployment of pruned LLMs by reducing memory bandwidth and compute requirements without large-scale fine-tuning.

Abstract

Pruning large language models (LLMs) is a challenging task due to their enormous size. The primary difficulty is fine-tuning the model after pruning, which is needed to recover the lost performance caused by dropping weights. Recent approaches have either ignored fine-tuning entirely, focusing on efficient pruning criteria, or attempted layer-wise weight updates, preserving the behavior of each layer. However, even layer-wise weight updates can be costly for LLMs, and previous works have resorted to various approximations. In our paper, we propose a fast and effective weight update algorithm for pruned layers based on the Alternating Direction Method of Multipliers (ADMM). We further extend it with a simple gradual pruning mask selection and achieve state-of-the-art pruning performance across a wide range of LLMs. Code is available at https://github.com/fmfi-compbio/admm-pruning.

Fast and Effective Weight Update for Pruned Large Language Models

TL;DR

The paper tackles the challenge of pruning large language models without expensive fine-tuning by introducing a fast, ADMM-based weight-update mechanism for pruned layers, combined with mask preconditioning and a gradual pruning schedule. The core objective is to minimize the reconstruction error while enforcing the pruning mask, solved efficiently via ADMM with a single matrix inverse and few iterations. Empirically, the method achieves state-of-the-art pruning performance across LLaMA-7B and multiple LLaMA-2 variants, often outperforming prior approaches like Wanda and SparseGPT while maintaining low overhead. This approach enables more practical deployment of pruned LLMs by reducing memory bandwidth and compute requirements without large-scale fine-tuning.

Abstract

Pruning large language models (LLMs) is a challenging task due to their enormous size. The primary difficulty is fine-tuning the model after pruning, which is needed to recover the lost performance caused by dropping weights. Recent approaches have either ignored fine-tuning entirely, focusing on efficient pruning criteria, or attempted layer-wise weight updates, preserving the behavior of each layer. However, even layer-wise weight updates can be costly for LLMs, and previous works have resorted to various approximations. In our paper, we propose a fast and effective weight update algorithm for pruned layers based on the Alternating Direction Method of Multipliers (ADMM). We further extend it with a simple gradual pruning mask selection and achieve state-of-the-art pruning performance across a wide range of LLMs. Code is available at https://github.com/fmfi-compbio/admm-pruning.
Paper Structure (17 sections, 2 theorems, 12 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 2 theorems, 12 equations, 2 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Let Assumptions 1 and 2 hold. Then:

Figures (2)

  • Figure 1: Reconstruction error over time (in seconds) during optimization of weights in selected layers of LLaMA-7B. The mask was derived by Wanda using 50% sparsity. We compare our proposed ADMM algorithm to SGD with momentum and Adam using various learning rates. We also compare to the SparseGPT update. Our ADMM update converges much faster than other methods and is better than the SparseGPT update.
  • Figure 2: WikiText perplexity vs time overhead for ADMM, Adam, and SparseGPT weight update on LLaMA-7B. We run ADMM and Adam for 1, 10, 20, 50 and 100 update steps and test Adam with various learning rates. The top plot shows 60% sparsity. The bottom one uses 80% sparsity. SparseGPT full refers to normal SparseGPT, which also selects the pruning mask gradually. All other options just update weights over a fixed mask selected by Wanda. Our weight update is better than the one in SparseGPT and better than gradient-based methods.

Theorems & Definitions (3)

  • Theorem 1
  • Corollary 1
  • proof