Table of Contents
Fetching ...

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

Rocktim Jyoti Das, Mingjie Sun, Liqun Ma, Zhiqiang Shen

TL;DR

This work tackles the challenge of efficiently pruning billion-parameter LLMs without retraining. It introduces GBLM-Pruner, a gradient-based, training-free pruning method derived from a first-order Taylor (OBS) analysis, using calibrated gradients and activations to score weights via $W_m[i,j] = |W[i,j]|\|X[:,j]\|_2 + \alpha|W[i,j]|\|G[:,i,j]\|_p$. The approach outperforms magnitude pruning, SparseGPT, and Wanda on LLaMA-1/2 in perplexity and zero-shot tasks, with robust ablations and theoretical grounding. It also demonstrates a positive influence on learned pruning patterns, revealing structural geometry in LLM parameters, and extends to Vision Transformers, highlighting practical impact for scalable, efficient deployment of large models.

Abstract

Large Language Models (LLMs) with billions of parameters are prime targets for network pruning, removing some model weights without hurting performance. Prior approaches such as magnitude pruning, SparseGPT, and Wanda, either concentrated solely on weights or integrated weights with activations for sparsity. However, they overlooked the informative gradients derived from pretrained LLMs. In this paper, we present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner). GBLM-Pruner leverages the first-order term of the Taylor expansion, operating in a training-free manner by harnessing properly normalized gradients from a few calibration samples to determine the pruning metric, and substantially outperforms competitive counterparts like SparseGPT and Wanda in multiple benchmarks. Intriguingly, by incorporating gradients, unstructured pruning with our method tends to reveal some structural patterns, which mirrors the geometric interdependence inherent in the LLMs' parameter structure. Additionally, GBLM-Pruner functions without any subsequent retraining or weight updates to maintain its simplicity as other counterparts. Extensive evaluations on LLaMA-1 and LLaMA-2 across various benchmarks show that GBLM-Pruner surpasses magnitude pruning, Wanda and SparseGPT by significant margins. We further extend our approach on Vision Transformer. Our code and models are available at https://github.com/VILA-Lab/GBLM-Pruner.

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

TL;DR

This work tackles the challenge of efficiently pruning billion-parameter LLMs without retraining. It introduces GBLM-Pruner, a gradient-based, training-free pruning method derived from a first-order Taylor (OBS) analysis, using calibrated gradients and activations to score weights via . The approach outperforms magnitude pruning, SparseGPT, and Wanda on LLaMA-1/2 in perplexity and zero-shot tasks, with robust ablations and theoretical grounding. It also demonstrates a positive influence on learned pruning patterns, revealing structural geometry in LLM parameters, and extends to Vision Transformers, highlighting practical impact for scalable, efficient deployment of large models.

Abstract

Large Language Models (LLMs) with billions of parameters are prime targets for network pruning, removing some model weights without hurting performance. Prior approaches such as magnitude pruning, SparseGPT, and Wanda, either concentrated solely on weights or integrated weights with activations for sparsity. However, they overlooked the informative gradients derived from pretrained LLMs. In this paper, we present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner). GBLM-Pruner leverages the first-order term of the Taylor expansion, operating in a training-free manner by harnessing properly normalized gradients from a few calibration samples to determine the pruning metric, and substantially outperforms competitive counterparts like SparseGPT and Wanda in multiple benchmarks. Intriguingly, by incorporating gradients, unstructured pruning with our method tends to reveal some structural patterns, which mirrors the geometric interdependence inherent in the LLMs' parameter structure. Additionally, GBLM-Pruner functions without any subsequent retraining or weight updates to maintain its simplicity as other counterparts. Extensive evaluations on LLaMA-1 and LLaMA-2 across various benchmarks show that GBLM-Pruner surpasses magnitude pruning, Wanda and SparseGPT by significant margins. We further extend our approach on Vision Transformer. Our code and models are available at https://github.com/VILA-Lab/GBLM-Pruner.
Paper Structure (23 sections, 20 equations, 5 figures, 11 tables, 1 algorithm)

This paper contains 23 sections, 20 equations, 5 figures, 11 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of our method GBLM-Pruner. Given a weight matrix, $\mathbf W$, a gradient matrix, $\mathbf G$, and an input feature activation, $\mathbf X$, weight importance is computed as an element-wise multiplication of weight magnitude and $\ell_1$ or $\ell_2$ norm of the gradients across multiple samples, denoted as $\|\mathbf G\|_p \cdot |\mathbf W|$, optionally, it is promotable to add the multiplication of weight and the $\ell_2$ norm of input activations, denoted as $|\mathbf W| \cdot \|\mathbf X\|_2$.
  • Figure 2: Sparsity variation results for a large and a small model where we compare the performance of our method against other baseline methods.
  • Figure 3: Robustness to calibration samples.
  • Figure 4: Illustration of learned pruning pattern.
  • Figure : The GBLM-Pruner algorithm