Table of Contents
Fetching ...

ACE: Exploring Activation Cosine Similarity and Variance for Accurate and Calibration-Efficient LLM Pruning

Zhendong Mi, Zhenglun Kong, Geng Yuan, Shaoyi Huang

TL;DR

This paper tackles the challenge of pruning large language models efficiently without sacrificing accuracy or calibration. It introduces two metrics, CosP and VarP, that respectively minimize angular deviations in output activations and preserve token-level semantic distinctions by leveraging input-activation variance. A final combined metric, S_{(cos+var)_{ij}}, unifies these ideas to guide pruning decisions. The authors provide a calibration-data-efficiency analysis and demonstrate through extensive experiments on LLaMA, LLaMA-2, and OPT that CosP, VarP, and especially CosP+VarP outperform strong baselines (Wanda, RIA) in perplexity, zero-shot accuracy, and pruning speed, even with reduced calibration data and N:M sparsity settings. The work offers practical pruning strategies with improved calibration efficiency for deploying large models in resource-constrained environments, and points to future extensions to quantization and broader architectures.

Abstract

With the rapid expansion of large language models (LLMs), the demand for memory and computational resources has grown significantly. Recent advances in LLM pruning aim to reduce the size and computational cost of these models. However, existing methods often suffer from either suboptimal pruning performance or low time efficiency during the pruning process. In this work, we propose an efficient and effective pruning method that simultaneously achieves high pruning performance and fast pruning speed with improved calibration efficiency. Our approach introduces two key innovations: (1) An activation cosine similarity loss-guided pruning metric, which considers the angular deviation of the output activation between the dense and pruned models. (2) An activation variance-guided pruning metric, which helps preserve semantic distinctions in output activations after pruning, enabling effective pruning with shorter input sequences. These two components can be readily combined to enhance LLM pruning in both accuracy and efficiency. Experimental results show that our method achieves up to an 18% reduction in perplexity and up to 63% decrease in pruning time on prevalent LLMs such as LLaMA, LLaMA-2, and OPT.

ACE: Exploring Activation Cosine Similarity and Variance for Accurate and Calibration-Efficient LLM Pruning

TL;DR

This paper tackles the challenge of pruning large language models efficiently without sacrificing accuracy or calibration. It introduces two metrics, CosP and VarP, that respectively minimize angular deviations in output activations and preserve token-level semantic distinctions by leveraging input-activation variance. A final combined metric, S_{(cos+var)_{ij}}, unifies these ideas to guide pruning decisions. The authors provide a calibration-data-efficiency analysis and demonstrate through extensive experiments on LLaMA, LLaMA-2, and OPT that CosP, VarP, and especially CosP+VarP outperform strong baselines (Wanda, RIA) in perplexity, zero-shot accuracy, and pruning speed, even with reduced calibration data and N:M sparsity settings. The work offers practical pruning strategies with improved calibration efficiency for deploying large models in resource-constrained environments, and points to future extensions to quantization and broader architectures.

Abstract

With the rapid expansion of large language models (LLMs), the demand for memory and computational resources has grown significantly. Recent advances in LLM pruning aim to reduce the size and computational cost of these models. However, existing methods often suffer from either suboptimal pruning performance or low time efficiency during the pruning process. In this work, we propose an efficient and effective pruning method that simultaneously achieves high pruning performance and fast pruning speed with improved calibration efficiency. Our approach introduces two key innovations: (1) An activation cosine similarity loss-guided pruning metric, which considers the angular deviation of the output activation between the dense and pruned models. (2) An activation variance-guided pruning metric, which helps preserve semantic distinctions in output activations after pruning, enabling effective pruning with shorter input sequences. These two components can be readily combined to enhance LLM pruning in both accuracy and efficiency. Experimental results show that our method achieves up to an 18% reduction in perplexity and up to 63% decrease in pruning time on prevalent LLMs such as LLaMA, LLaMA-2, and OPT.

Paper Structure

This paper contains 24 sections, 44 equations, 2 figures, 13 tables.

Figures (2)

  • Figure 1: Example of angular deviation before and after pruning
  • Figure 2: The motivating example of our proposed activation variance-guided pruning metric