Table of Contents
Fetching ...

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

Haiquan Lu, Yefan Zhou, Shiwei Liu, Zhangyang Wang, Michael W. Mahoney, Yaoqing Yang

TL;DR

This paper uses Heavy-Tailed Self-Regularization (HT-SR) Theory, in particular the shape of empirical spectral densities of weight matrices, to design improved layerwise pruning ratios for LLMs, and proposes AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner.

Abstract

Recent work on pruning large language models (LLMs) has shown that one can eliminate a large number of parameters without compromising performance, making pruning a promising strategy to reduce LLM model size. Existing LLM pruning strategies typically assign uniform pruning ratios across layers, limiting overall pruning ability; and recent work on layerwise pruning of LLMs is often based on heuristics that can easily lead to suboptimal performance. In this paper, we leverage Heavy-Tailed Self-Regularization (HT-SR) Theory, in particular the shape of empirical spectral densities (ESDs) of weight matrices, to design improved layerwise pruning ratios for LLMs. Our analysis reveals a wide variability in how well-trained, and thus relatedly how prunable, different layers of an LLM are. Based on this, we propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner. AlphaPruning can be used in conjunction with multiple existing LLM pruning methods. Our empirical results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs. We have open-sourced our code at https://github.com/haiquanlu/AlphaPruning.

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

TL;DR

This paper uses Heavy-Tailed Self-Regularization (HT-SR) Theory, in particular the shape of empirical spectral densities of weight matrices, to design improved layerwise pruning ratios for LLMs, and proposes AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner.

Abstract

Recent work on pruning large language models (LLMs) has shown that one can eliminate a large number of parameters without compromising performance, making pruning a promising strategy to reduce LLM model size. Existing LLM pruning strategies typically assign uniform pruning ratios across layers, limiting overall pruning ability; and recent work on layerwise pruning of LLMs is often based on heuristics that can easily lead to suboptimal performance. In this paper, we leverage Heavy-Tailed Self-Regularization (HT-SR) Theory, in particular the shape of empirical spectral densities (ESDs) of weight matrices, to design improved layerwise pruning ratios for LLMs. Our analysis reveals a wide variability in how well-trained, and thus relatedly how prunable, different layers of an LLM are. Based on this, we propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner. AlphaPruning can be used in conjunction with multiple existing LLM pruning methods. Our empirical results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs. We have open-sourced our code at https://github.com/haiquanlu/AlphaPruning.

Paper Structure

This paper contains 56 sections, 4 equations, 15 figures, 25 tables.

Figures (15)

  • Figure 1: The pipeline diagram of AlphaPruning. Our post-training layer-wise pruning method involves the following steps: (i) Performing ESD analysis on all weight matrices of a base LLM and (ii) employing PL fitting to derive the layer-wise metric values (that measures the HT exponent). Then, (iii) using the layer-wise metric values, we assign layer-wise pruning ratios to each layer through a linear assignment function.
  • Figure 2: ImageNet-1K accuracy ($\uparrow$) of the sparse ConvNext model pruned to various sparsity levels by AlphaPruning and other baseline methods, without fine-tuning.
  • Figure 3: Comparing layerwise sparsities of AlphaPruning and uniform sparsities, at 80% global sparsity on LLaMA-7B. The curves represent the layerwise sparsities, which are determined by PL_Alpha_Hill values shown by the histograms.
  • Figure 4: Analyzing the heavy-tail metric PL_Alpha_Hill (lower the better by HT-SR theory) and performance metric WikiText validation perplexity (lower the better) before and after pruning by baseline uniform pruning and AlphaPruning. (a) The metric value is reported by averaging over all layers within each model. The dashed lines represent the perplexity and the histograms represent the PL_Alpha_Hill value. (b) The metric is reported by averaging all the matrices within each LLM layer.
  • Figure 5: Analyzing ESD properties and assignment strategies for LRA. (a) Stable_Rank and PL_Alpha_Hill show a similar pattern (the more heavy-tailed, the more low-ranked) across different ESDs sampled from Pareto distribution. (b) The layer-wise PL_Alpha_Hill and Stable_Rank of the LLaMA-7B model exhibit a similar trend. (c) Comparing two assignment strategies for LRA, and "Compress more on HTed layers" is better. This finding is opposite to pruning-based methods, which find that "Compress less on HTed layers" is better.
  • ...and 10 more figures