ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

Mingluo Su; Huan Wang

ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

Mingluo Su, Huan Wang

TL;DR

This paper proposes ROSE, a reordered SparseGPT method that prioritizes weights with larger potential pruning errors to be pruned earlier, and introduces the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model.

Abstract

Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient deployment and inference. One classic and prominent path of LLM one-shot pruning is to leverage second-order gradients (i.e., Hessian), represented by the pioneering work SparseGPT. However, the predefined left-to-right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weights with larger potential pruning errors to be pruned earlier. ROSE first performs pre-pruning to identify candidate weights for removal, and estimates both column and block pruning loss. Subsequently, two-level reordering is performed: columns within each block are reordered in descending order of column loss, while blocks are reordered based on block loss. We introduce the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model. Substantial empirical results on prevalent LLMs (LLaMA2-7B/13B/70B, LLaMA3-8B, Mistral-7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods. Our code is available at https://github.com/mingluo-su/ROSE.

ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

TL;DR

Abstract

Paper Structure (25 sections, 14 equations, 13 figures, 11 tables, 1 algorithm)

This paper contains 25 sections, 14 equations, 13 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Network Pruning
Unstructured Pruning for LLMs
Prerequisites
Methodology
Analyses
Proposed ROSE
Experimental Results
Experiment Settings
Reconstruction Error Analyses
Main Benchmark Results
Ablation Study
Running Consumption Analyses
Conclusion
...and 10 more sections

Figures (13)

Figure 1: (a) Change reconstruction error of the "self_attn.o_proj" layer in the first Transformer Block of LLaMA2-7B during SparseGPT pruning as the number of pruned blocks increases. The sharpest increase in reconstruction error appears at a later stage. (b) Weight visualization of the corresponding layer. It exhibits a columnar pattern along the input channel, and there is a block with the most concentrated high-magnitude weights as illustrated. (c) Different reconstruction error after reordering the original block with the highest pruning error. The earlier the original block is pruned, the smaller the reconstruction error.
Figure 2: The distribution of relative change of weights before and after pruning. The majority of weights remain relatively stable.
Figure 3: (a) Overview of difference between SparseGPT and ROSE. Orange color represents weight importance, and the darker the color, the greater the importance. In SparseGPT, the number of weights available for error compensation (shown in dark blue) decreases during pruning, limiting recovery if high-error weights are pruned late. ROSE reorders those with potentially large pruning errors to the front to be pruned earlier. In this way, more parameters remain available for larger error compensation. (b) Illustration of our ROSE for the columnar layer. Given the dense weight $\mathbf{W}$ and target sparsity rate $p\%$, we calculate the importance score $\mathbf{S}$ and split it into blocks based on $B_S$. The smallest $p\%$ of values from each block are selected as the loss matrix $\mathbf{L}$. Column loss and block loss are calculated based on the loss matrix. Columns within one block are reordered in descending order of column loss, and blocks are reordered in descending order of block loss.
Figure 4: Relative reconstruction error of the "self_attn.o_proj" layer in the second Transformer Block of LLaMA2-7B by ROSE and its variants at varying sparsity rates.
Figure 5: Ablation study of blocksize, calibration samples, and calibration sequence length in LLaMA2-7B at 70% sparsity rate.
...and 8 more figures

ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

TL;DR

Abstract

ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

Authors

TL;DR

Abstract

Table of Contents

Figures (13)