Table of Contents
Fetching ...

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

Song Guo, Jiahang Xu, Li Lyna Zhang, Mao Yang

TL;DR

Compresso tackles the challenge of efficiently pruning large language models by enabling training-based structured pruning with memory efficiency and collaborative prompting. It combines $LoRA$-enhanced, $L_0$-regularized pruning masks with an instruction-tuning data-driven pruning pipeline, and introduces a dedicated collaborative prompt that lets the LLM participate in pruning decisions. On LLaMA-7B, Compresso reduces size to 5.4B while preserving performance on zero-shot commonsense, reading comprehension, and few-shot MMLU/BBH, even surpassing the original model on some tasks. Across sparsity settings, it consistently outperforms one-shot baselines, demonstrating the practical viability of training-based LLM pruning with collaborative prompting for deployment.

Abstract

Despite the remarkable success of Large Language Models (LLMs), the massive size poses significant deployment challenges, particularly on resource-constrained hardware. While existing LLM compression methods focus on quantization, pruning remains relatively unexplored due to the high cost of training-based approaches and data collection challenges. One-shot pruning methods, although cost-effective and data-free, have become dominant in LLM pruning, but lead to performance decline under the structured pruning setting. In this work, we introduce a new paradigm for structurally pruning LLMs, called Compresso. Our approach, through the collaboration of the proposed resource-efficient pruning algorithm and the LLM itself, learns optimal pruning decisions during the training process. Compresso addresses the challenges of expensive training costs and data collection by incorporating Low-Rank Adaptation (LoRA) into the $L_0$ regularization during the instruction tuning process. Then, we further augment the pruning algorithm by introducing a collaborative prompt that fosters collaboration between the LLM and the pruning algorithm, significantly boosting the overall performance. To this end, Compresso prunes LLaMA-7B to 5.4B, maintaining original performance and even surpassing LLaMA-7B in reading comprehension by 2.62%. Extensive experiments demonstrate that Compresso significantly outperforms one-shot pruning baselines across various sparsity ratios, achieving up to 2.21%, 11.43%, 7.04%, and 4.81% higher scores on the commonsense reasoning, reading comprehension, MMLU, and BBH benchmarks, respectively.

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

TL;DR

Compresso tackles the challenge of efficiently pruning large language models by enabling training-based structured pruning with memory efficiency and collaborative prompting. It combines -enhanced, -regularized pruning masks with an instruction-tuning data-driven pruning pipeline, and introduces a dedicated collaborative prompt that lets the LLM participate in pruning decisions. On LLaMA-7B, Compresso reduces size to 5.4B while preserving performance on zero-shot commonsense, reading comprehension, and few-shot MMLU/BBH, even surpassing the original model on some tasks. Across sparsity settings, it consistently outperforms one-shot baselines, demonstrating the practical viability of training-based LLM pruning with collaborative prompting for deployment.

Abstract

Despite the remarkable success of Large Language Models (LLMs), the massive size poses significant deployment challenges, particularly on resource-constrained hardware. While existing LLM compression methods focus on quantization, pruning remains relatively unexplored due to the high cost of training-based approaches and data collection challenges. One-shot pruning methods, although cost-effective and data-free, have become dominant in LLM pruning, but lead to performance decline under the structured pruning setting. In this work, we introduce a new paradigm for structurally pruning LLMs, called Compresso. Our approach, through the collaboration of the proposed resource-efficient pruning algorithm and the LLM itself, learns optimal pruning decisions during the training process. Compresso addresses the challenges of expensive training costs and data collection by incorporating Low-Rank Adaptation (LoRA) into the regularization during the instruction tuning process. Then, we further augment the pruning algorithm by introducing a collaborative prompt that fosters collaboration between the LLM and the pruning algorithm, significantly boosting the overall performance. To this end, Compresso prunes LLaMA-7B to 5.4B, maintaining original performance and even surpassing LLaMA-7B in reading comprehension by 2.62%. Extensive experiments demonstrate that Compresso significantly outperforms one-shot pruning baselines across various sparsity ratios, achieving up to 2.21%, 11.43%, 7.04%, and 4.81% higher scores on the commonsense reasoning, reading comprehension, MMLU, and BBH benchmarks, respectively.
Paper Structure (13 sections, 5 equations, 3 figures, 8 tables)

This paper contains 13 sections, 5 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: The overall framework of Compresso. We propose a collaborative pruning framework, where a memory-efficient pruning algorithm and target LLM work together through a collaborative prompt to learn optimal pruning decisions.
  • Figure 2: An example to illustrate the use of our prompt in the proposed collaborative pruning.
  • Figure 3: The remaining ratios of heads (upper) and FFN intermediate size (lower) among various layers when targeting a size of 4.5B.