Table of Contents
Fetching ...

What Matters In The Structured Pruning of Generative Language Models?

Michael Santacroce, Zixin Wen, Yelong Shen, Yuanzhi Li

TL;DR

Decoder-only large language models incur high computational costs, motivating a systematic study of structured pruning methods for natural language generation. The authors demonstrate that random neuron pruning often rivals traditional movement and magnitude pruning, and they propose a redundancy-based analysis using sensitivity and uniqueness to diagnose pruning behavior. They introduce Globally Unique Movement (GUM), which combines global Top_v pruning with a cosine-similarity regularizer to increase neuron uniqueness while maintaining performance, with strong results across Wikitext-103, WikiSQL, and SAMsum. The work further shows that distillation can significantly narrow method gaps, offering practical guidance for pruning decoder-only LLMs while balancing compression and quality.

Abstract

Auto-regressive large language models such as GPT-3 require enormous computational resources to use. Traditionally, structured pruning methods are employed to reduce resource usage. However, their application to and efficacy for generative language models is heavily under-explored. In this paper we conduct an comprehensive evaluation of common structured pruning methods, including magnitude, random, and movement pruning on the feed-forward layers in GPT-type models. Unexpectedly, random pruning results in performance that is comparable to the best established methods, across multiple natural language generation tasks. To understand these results, we provide a framework for measuring neuron-level redundancy of models pruned by different methods, and discover that established structured pruning methods do not take into account the distinctiveness of neurons, leaving behind excess redundancies. In view of this, we introduce Globally Unique Movement (GUM) to improve the uniqueness of neurons in pruned models. We then discuss the effects of our techniques on different redundancy metrics to explain the improved performance.

What Matters In The Structured Pruning of Generative Language Models?

TL;DR

Decoder-only large language models incur high computational costs, motivating a systematic study of structured pruning methods for natural language generation. The authors demonstrate that random neuron pruning often rivals traditional movement and magnitude pruning, and they propose a redundancy-based analysis using sensitivity and uniqueness to diagnose pruning behavior. They introduce Globally Unique Movement (GUM), which combines global Top_v pruning with a cosine-similarity regularizer to increase neuron uniqueness while maintaining performance, with strong results across Wikitext-103, WikiSQL, and SAMsum. The work further shows that distillation can significantly narrow method gaps, offering practical guidance for pruning decoder-only LLMs while balancing compression and quality.

Abstract

Auto-regressive large language models such as GPT-3 require enormous computational resources to use. Traditionally, structured pruning methods are employed to reduce resource usage. However, their application to and efficacy for generative language models is heavily under-explored. In this paper we conduct an comprehensive evaluation of common structured pruning methods, including magnitude, random, and movement pruning on the feed-forward layers in GPT-type models. Unexpectedly, random pruning results in performance that is comparable to the best established methods, across multiple natural language generation tasks. To understand these results, we provide a framework for measuring neuron-level redundancy of models pruned by different methods, and discover that established structured pruning methods do not take into account the distinctiveness of neurons, leaving behind excess redundancies. In view of this, we introduce Globally Unique Movement (GUM) to improve the uniqueness of neurons in pruned models. We then discuss the effects of our techniques on different redundancy metrics to explain the improved performance.
Paper Structure (35 sections, 5 equations, 6 figures, 18 tables, 1 algorithm)

This paper contains 35 sections, 5 equations, 6 figures, 18 tables, 1 algorithm.

Figures (6)

  • Figure 1: Sensitivity and Uniqueness measured on the training set for GPT-Neo-125m. The vertical axis is defined as the ratio of the corresponding metric between the pruned model and a baseline model (which is non-pruned and fully fine-tuned) with a maximum of 1x. We are able to use these graphs to analyze and compare the performance of different pruning methods. Details of measurements are given in Appendix \ref{['apx:graphs']}.
  • Figure 2: Sensitivity and Uniqueness measured on the training set for GPT-2-sm. The vertical axis is defined as the ratio of the corresponding metric between the pruned model and a baseline model (which is non-pruned and fully fine-tuned) with a maximum of 1x. Details of measurements are given in Appendix \ref{['apx:graphs']}.
  • Figure 3: For each layer, this graph shows the percentage of neurons with at least one similarity per range. Similarity is defined as the absolute value of cosine similarity over the entire validation dataset, increasing from 0 to 1. $\text{Top}_v$ and GUM are compared, training on WikiSQL with GPT-Neo-125m. Total leftover neurons is exactly 25% of all neurons.
  • Figure 4: The percentage leftover for layers 1-12 after Global $\text{Top}_v$ pruning, using GUM training on WikiSQL with GPT-Neo-125m. Total leftover neurons is exactly 25% of all neurons.
  • Figure 5: Sensitivity and Uniqueness measured without re-scaling, for GPT-2-sm.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Definition 3.1: Redundancy Criteria