Table of Contents
Fetching ...

Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining

Jianwei Li, Yijun Dong, Qi Lei

TL;DR

This work tackles the challenge of pruning large language models without retraining by identifying depth-2 pruning structures within Transformer blocks and proposing inference-aware criteria anchored in output-approximation. It combines a depth-2 module pruning strategy with two novel metrics—a similarity-based attention-head metric and a second-moment-based depth-2 metric—and introduces a pre-pruning recovery step to mitigate pruning errors without backpropagation. Empirical results across multiple datasets and model scales show substantial reductions in computation and hardware requirements while preserving or improving performance relative to several data-free, data-dependent, and retraining-based baselines. The approach emphasizes structured pruning aligned with hardware considerations and provides practical mechanisms for efficient LLM compression with minimal retraining overhead.

Abstract

To remove redundant components of large language models (LLMs) without incurring significant computational costs, this work focuses on single-shot pruning without a retraining phase. We simplify the pruning process for Transformer-based LLMs by identifying a depth-2 pruning structure that functions independently. Additionally, we propose two inference-aware pruning criteria derived from the optimization perspective of output approximation, which outperforms traditional training-aware metrics such as gradient and Hessian. We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining. Experimental results demonstrate that our approach significantly reduces computational costs and hardware requirements while maintaining superior performance across various datasets and models.

Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining

TL;DR

This work tackles the challenge of pruning large language models without retraining by identifying depth-2 pruning structures within Transformer blocks and proposing inference-aware criteria anchored in output-approximation. It combines a depth-2 module pruning strategy with two novel metrics—a similarity-based attention-head metric and a second-moment-based depth-2 metric—and introduces a pre-pruning recovery step to mitigate pruning errors without backpropagation. Empirical results across multiple datasets and model scales show substantial reductions in computation and hardware requirements while preserving or improving performance relative to several data-free, data-dependent, and retraining-based baselines. The approach emphasizes structured pruning aligned with hardware considerations and provides practical mechanisms for efficient LLM compression with minimal retraining overhead.

Abstract

To remove redundant components of large language models (LLMs) without incurring significant computational costs, this work focuses on single-shot pruning without a retraining phase. We simplify the pruning process for Transformer-based LLMs by identifying a depth-2 pruning structure that functions independently. Additionally, we propose two inference-aware pruning criteria derived from the optimization perspective of output approximation, which outperforms traditional training-aware metrics such as gradient and Hessian. We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining. Experimental results demonstrate that our approach significantly reduces computational costs and hardware requirements while maintaining superior performance across various datasets and models.
Paper Structure (25 sections, 3 equations, 6 figures, 5 tables, 4 algorithms)

This paper contains 25 sections, 3 equations, 6 figures, 5 tables, 4 algorithms.

Figures (6)

  • Figure 1: Pruning metric analysis from the optimization perspective A: Function Approximation; B: Output Approximation; C: Objective Approximation.
  • Figure 2: Pruning structure recognition. A: Two pruning strategies for the depth-2 module. B: Depth-2 modules identification in Transformer-based LLMs.
  • Figure 3: Similarity visualization of attention heads in A: block 4 and B: block 5 for Llama-7B. Heads with divergence less than $\tau = 0.20$ are connected.
  • Figure 4: Performance of compressed A: LLaMA-7B (w/o Remediation) and B: GPT-2 (w/ Remediation) concerning the number of calibration samples.
  • Figure 5: Mean activation value of Llama-7B and GPT-2 on Wikitext2.
  • ...and 1 more figures