Table of Contents
Fetching ...

MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models

Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi

TL;DR

This work tackles the memory bottleneck of gradient-based pruning for large language models by introducing MINI-LLM, a memory-efficient, structured pruning framework. It combines a novel Feature Map Sensitivity (FMS) score that fuses magnitude, activation, and gradient information with a Zeroth-Order gradient estimation method (based on SPSA) to guide pruning using only forward passes. After pruning, performance is recovered via LoRA-based fine-tuning, enabling efficient adaptation with low memory overhead. Across LLaMA, BLOOM, and OPT, MINI-LLM consistently outperforms gradient-free baselines and rivals backpropagation-based pruning while maintaining similar memory usage, demonstrating practical applicability for compressing and accelerating large language models. The results suggest that estimating gradients from forward passes can effectively leverage gradient information for pruning at scale, enabling memory-efficient optimization and deployment of substantial LLMs.

Abstract

As Large Language Models (LLMs) grow dramatically in size, there is an increasing trend in compressing and speeding up these models. Previous studies have highlighted the usefulness of gradients for importance scoring in neural network compressing, especially in pruning medium-size networks. However, the substantial memory requirements involved in calculating gradients with backpropagation impede the utilization of gradients in guiding LLM pruning. As a result, most pruning strategies for LLMs rely on gradient-free criteria, such as weight magnitudes or a mix of magnitudes and activations. In this paper, we devise a hybrid pruning criterion, which appropriately integrates magnitude, activation, and gradient to capitalize on feature map sensitivity for pruning LLMs. To overcome memory requirement barriers, we estimate gradients using only forward passes. Based on this, we propose a Memory-effIcieNt structured prunIng procedure for LLMs (MINI-LLM) to remove no-critical channels and multi-attention heads. Experimental results demonstrate the superior performance of MINI-LLM over existing gradient-free methods on three LLMs: LLaMA, BLOOM, and OPT across various downstream tasks (classification, multiple-choice, and generation), while MINI-LLM maintains a GPU memory footprint akin to gradient-free methods.

MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models

TL;DR

This work tackles the memory bottleneck of gradient-based pruning for large language models by introducing MINI-LLM, a memory-efficient, structured pruning framework. It combines a novel Feature Map Sensitivity (FMS) score that fuses magnitude, activation, and gradient information with a Zeroth-Order gradient estimation method (based on SPSA) to guide pruning using only forward passes. After pruning, performance is recovered via LoRA-based fine-tuning, enabling efficient adaptation with low memory overhead. Across LLaMA, BLOOM, and OPT, MINI-LLM consistently outperforms gradient-free baselines and rivals backpropagation-based pruning while maintaining similar memory usage, demonstrating practical applicability for compressing and accelerating large language models. The results suggest that estimating gradients from forward passes can effectively leverage gradient information for pruning at scale, enabling memory-efficient optimization and deployment of substantial LLMs.

Abstract

As Large Language Models (LLMs) grow dramatically in size, there is an increasing trend in compressing and speeding up these models. Previous studies have highlighted the usefulness of gradients for importance scoring in neural network compressing, especially in pruning medium-size networks. However, the substantial memory requirements involved in calculating gradients with backpropagation impede the utilization of gradients in guiding LLM pruning. As a result, most pruning strategies for LLMs rely on gradient-free criteria, such as weight magnitudes or a mix of magnitudes and activations. In this paper, we devise a hybrid pruning criterion, which appropriately integrates magnitude, activation, and gradient to capitalize on feature map sensitivity for pruning LLMs. To overcome memory requirement barriers, we estimate gradients using only forward passes. Based on this, we propose a Memory-effIcieNt structured prunIng procedure for LLMs (MINI-LLM) to remove no-critical channels and multi-attention heads. Experimental results demonstrate the superior performance of MINI-LLM over existing gradient-free methods on three LLMs: LLaMA, BLOOM, and OPT across various downstream tasks (classification, multiple-choice, and generation), while MINI-LLM maintains a GPU memory footprint akin to gradient-free methods.
Paper Structure (12 sections, 10 equations, 6 figures, 12 tables, 1 algorithm)

This paper contains 12 sections, 10 equations, 6 figures, 12 tables, 1 algorithm.

Figures (6)

  • Figure 1: The peak GPU-memory Usage for pruning LLaMA-7B. The backpropagation gradient-based pruning method, LLM-Pruner, consumes about twice the GPU resources compared to gradient-free methods and our method MINI-LLM during pruning LLaMA-7B.
  • Figure 2: Similarity in pruned channels at the prune ratio of 30%. LLM-Pruner and MINI-LLM (ours) have more similar pruned channels compared to LLM-Pruner and Wanda.
  • Figure 3: Zero-shot perplexity of the pruned LLaMA-13B models when fine-tuning is applied. MINI-LLM consistently maintains its substantial advantage over the magnitude-based method across a spectrum of pruning ratios.
  • Figure 4: The outcomes of gradient-based vs. gradient-free criteria for pruning LLaMA-7B. The results demonstrate that our score function FMS consistently yields better performance compared to Wanda’s pruning criterion.
  • Figure 5: The zero-shot perplexity of the pruned models achieved by enabling different ranges of layers involved in pruning LLaMA-7B on the PTB dataset. Layer-1-30 has the worst performance, while layer-3-30 and layer-4-29 have comparably better performance for both pruning methods. In contrast, the models derived from layer-5-28 pruning exhibit varying responses to different pruning methods.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Definition 1: ZO Gradient Estimation.