Table of Contents
Fetching ...

ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

Mingluo Su, Huan Wang

TL;DR

This paper proposes ROSE, a reordered SparseGPT method that prioritizes weights with larger potential pruning errors to be pruned earlier, and introduces the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model.

Abstract

Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient deployment and inference. One classic and prominent path of LLM one-shot pruning is to leverage second-order gradients (i.e., Hessian), represented by the pioneering work SparseGPT. However, the predefined left-to-right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weights with larger potential pruning errors to be pruned earlier. ROSE first performs pre-pruning to identify candidate weights for removal, and estimates both column and block pruning loss. Subsequently, two-level reordering is performed: columns within each block are reordered in descending order of column loss, while blocks are reordered based on block loss. We introduce the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model. Substantial empirical results on prevalent LLMs (LLaMA2-7B/13B/70B, LLaMA3-8B, Mistral-7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods. Our code is available at https://github.com/mingluo-su/ROSE.

ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

TL;DR

This paper proposes ROSE, a reordered SparseGPT method that prioritizes weights with larger potential pruning errors to be pruned earlier, and introduces the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model.

Abstract

Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient deployment and inference. One classic and prominent path of LLM one-shot pruning is to leverage second-order gradients (i.e., Hessian), represented by the pioneering work SparseGPT. However, the predefined left-to-right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weights with larger potential pruning errors to be pruned earlier. ROSE first performs pre-pruning to identify candidate weights for removal, and estimates both column and block pruning loss. Subsequently, two-level reordering is performed: columns within each block are reordered in descending order of column loss, while blocks are reordered based on block loss. We introduce the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model. Substantial empirical results on prevalent LLMs (LLaMA2-7B/13B/70B, LLaMA3-8B, Mistral-7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods. Our code is available at https://github.com/mingluo-su/ROSE.
Paper Structure (25 sections, 14 equations, 13 figures, 11 tables, 1 algorithm)

This paper contains 25 sections, 14 equations, 13 figures, 11 tables, 1 algorithm.

Figures (13)

  • Figure 1: (a) Change reconstruction error of the "self_attn.o_proj" layer in the first Transformer Block of LLaMA2-7B during SparseGPT pruning as the number of pruned blocks increases. The sharpest increase in reconstruction error appears at a later stage. (b) Weight visualization of the corresponding layer. It exhibits a columnar pattern along the input channel, and there is a block with the most concentrated high-magnitude weights as illustrated. (c) Different reconstruction error after reordering the original block with the highest pruning error. The earlier the original block is pruned, the smaller the reconstruction error.
  • Figure 2: The distribution of relative change of weights before and after pruning. The majority of weights remain relatively stable.
  • Figure 3: (a) Overview of difference between SparseGPT and ROSE. Orange color represents weight importance, and the darker the color, the greater the importance. In SparseGPT, the number of weights available for error compensation (shown in dark blue) decreases during pruning, limiting recovery if high-error weights are pruned late. ROSE reorders those with potentially large pruning errors to the front to be pruned earlier. In this way, more parameters remain available for larger error compensation. (b) Illustration of our ROSE for the columnar layer. Given the dense weight $\mathbf{W}$ and target sparsity rate $p\%$, we calculate the importance score $\mathbf{S}$ and split it into blocks based on $B_S$. The smallest $p\%$ of values from each block are selected as the loss matrix $\mathbf{L}$. Column loss and block loss are calculated based on the loss matrix. Columns within one block are reordered in descending order of column loss, and blocks are reordered in descending order of block loss.
  • Figure 4: Relative reconstruction error of the "self_attn.o_proj" layer in the second Transformer Block of LLaMA2-7B by ROSE and its variants at varying sparsity rates.
  • Figure 5: Ablation study of blocksize, calibration samples, and calibration sequence length in LLaMA2-7B at 70% sparsity rate.
  • ...and 8 more figures