Fast and Effective Weight Update for Pruned Large Language Models
Vladimír Boža
TL;DR
The paper tackles the challenge of pruning large language models without expensive fine-tuning by introducing a fast, ADMM-based weight-update mechanism for pruned layers, combined with mask preconditioning and a gradual pruning schedule. The core objective is to minimize the reconstruction error $\|XW - X (M \odot \widehat{W})\|_2^2$ while enforcing the pruning mask, solved efficiently via ADMM with a single matrix inverse $X^T X$ and few iterations. Empirically, the method achieves state-of-the-art pruning performance across LLaMA-7B and multiple LLaMA-2 variants, often outperforming prior approaches like Wanda and SparseGPT while maintaining low overhead. This approach enables more practical deployment of pruned LLMs by reducing memory bandwidth and compute requirements without large-scale fine-tuning.
Abstract
Pruning large language models (LLMs) is a challenging task due to their enormous size. The primary difficulty is fine-tuning the model after pruning, which is needed to recover the lost performance caused by dropping weights. Recent approaches have either ignored fine-tuning entirely, focusing on efficient pruning criteria, or attempted layer-wise weight updates, preserving the behavior of each layer. However, even layer-wise weight updates can be costly for LLMs, and previous works have resorted to various approximations. In our paper, we propose a fast and effective weight update algorithm for pruned layers based on the Alternating Direction Method of Multipliers (ADMM). We further extend it with a simple gradual pruning mask selection and achieve state-of-the-art pruning performance across a wide range of LLMs. Code is available at https://github.com/fmfi-compbio/admm-pruning.
