Table of Contents
Fetching ...

PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs

Max Zimmer, Megi Andoni, Christoph Spiegel, Sebastian Pokutta

TL;DR

The paper tackles the high cost of retraining after pruning large neural networks by showing that updating a tiny, highly expressive subset of parameters (e.g., biases or specific -parameters) can recover or even exceed the performance of full retraining across various sparsity levels. It introduces two LoRA-inspired variants, multlora and masklora, that preserve sparsity while enabling adapters to be merged back into the original weights, and demonstrates memory-efficient layer-wise reconstruction to boost reconstruction-based methods. Empirical results across OPT, LLaMA-2, Mistral, and Mixtral models (up to 30B parameters) show that retraining as little as 0.01–0.05% of parameters can match full retraining, with substantial gains in throughput and memory efficiency. The work offers a practical alternative to full retraining, enabling scalable prune-retrain workflows on very large language models and stimulating further development of parameter-efficient sparse fine-tuning techniques.

Abstract

Neural Networks can be effectively compressed through pruning, significantly reducing storage and compute demands while maintaining predictive performance. Simple yet effective methods like magnitude pruning remove less important parameters and typically require a costly retraining procedure to restore performance. However, with the rise of LLMs, full retraining has become infeasible due to memory and compute constraints. This study challenges the practice of retraining all parameters by showing that updating a small subset of highly expressive parameters can suffice to recover or even enhance performance after pruning. Surprisingly, retraining just 0.01%-0.05% of the parameters in GPT-architectures can match the performance of full retraining across various sparsity levels, significantly reducing compute and memory requirements, and enabling retraining of models with up to 30 billion parameters on a single GPU in minutes. To bridge the gap to full retraining in the high sparsity regime, we introduce two novel LoRA variants that, unlike standard LoRA, allow merging adapters back without compromising sparsity. Going a step further, we show that these methods can be applied for memory-efficient layer-wise reconstruction, significantly enhancing state-of-the-art retraining-free methods like Wanda (Sun et al., 2023) and SparseGPT (Frantar & Alistarh, 2023). Our findings present a promising alternative to avoiding retraining.

PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs

TL;DR

The paper tackles the high cost of retraining after pruning large neural networks by showing that updating a tiny, highly expressive subset of parameters (e.g., biases or specific -parameters) can recover or even exceed the performance of full retraining across various sparsity levels. It introduces two LoRA-inspired variants, multlora and masklora, that preserve sparsity while enabling adapters to be merged back into the original weights, and demonstrates memory-efficient layer-wise reconstruction to boost reconstruction-based methods. Empirical results across OPT, LLaMA-2, Mistral, and Mixtral models (up to 30B parameters) show that retraining as little as 0.01–0.05% of parameters can match full retraining, with substantial gains in throughput and memory efficiency. The work offers a practical alternative to full retraining, enabling scalable prune-retrain workflows on very large language models and stimulating further development of parameter-efficient sparse fine-tuning techniques.

Abstract

Neural Networks can be effectively compressed through pruning, significantly reducing storage and compute demands while maintaining predictive performance. Simple yet effective methods like magnitude pruning remove less important parameters and typically require a costly retraining procedure to restore performance. However, with the rise of LLMs, full retraining has become infeasible due to memory and compute constraints. This study challenges the practice of retraining all parameters by showing that updating a small subset of highly expressive parameters can suffice to recover or even enhance performance after pruning. Surprisingly, retraining just 0.01%-0.05% of the parameters in GPT-architectures can match the performance of full retraining across various sparsity levels, significantly reducing compute and memory requirements, and enabling retraining of models with up to 30 billion parameters on a single GPU in minutes. To bridge the gap to full retraining in the high sparsity regime, we introduce two novel LoRA variants that, unlike standard LoRA, allow merging adapters back without compromising sparsity. Going a step further, we show that these methods can be applied for memory-efficient layer-wise reconstruction, significantly enhancing state-of-the-art retraining-free methods like Wanda (Sun et al., 2023) and SparseGPT (Frantar & Alistarh, 2023). Our findings present a promising alternative to avoiding retraining.
Paper Structure (29 sections, 1 equation, 7 figures, 24 tables)

This paper contains 29 sections, 1 equation, 7 figures, 24 tables.

Figures (7)

  • Figure 1: OPT-2.7B evaluated on WikiText: Final perplexity vs. sparsity after pruning, followed by retraining only the specified parameter subset. We indicate the percentage of trainable parameters in parentheses. Full ft refers to full retraining of all parameters.
  • Figure 2: OPT-6.7B evaluated on WikiText: Final perplexity after retraining using masklora for as many iterations as indicated on the x-axis. masklora retrains roughly 1% of the parameters.
  • Figure 3: OPT-2.7B evaluated on WikiText: Final perplexity vs. sparsity after pruning, followed by retraining only the specified parameter subset. We indicate the percentage of trainable parameters in parentheses. Full ft refers to full retraining of all parameters.
  • Figure 4: OPT-2.7B evaluated on the EleutherAI tasks: Final average zero-shot accuracy vs. sparsity after pruning, followed by retraining only the specified parameter subset. We indicate the percentage of trainable parameters in parentheses. Full ft refers to full retraining of all parameters.
  • Figure 5: Features produced by a single filter from the first convolutional layer of AlexNetKrizhevsky2012. From left to right: original image, output from a pretrained model, and output from the magnitude-pruned version of the same model.
  • ...and 2 more figures