Table of Contents
Fetching ...

Efficient Post-Training Pruning of Large Language Models with Statistical Correction

Peiqi Yu, Jinhao Wang, Xinyi Sui, Nam Ling, Wei Wang, Wei Jiang

TL;DR

This work tackles the challenge of pruning large language models after training without incurring heavy computational costs. It introduces two key components: variance-calibrated weight selection (CVR), which modulates magnitude-based importance scores with channel variance to mitigate activation-outlier bias, and energy compensation (EC), a closed-form, non-gradient correction that realigns layer-wise activation energy after pruning. The approach relies solely on first-order statistics and does not require second-order information or retraining, enabling pruning costs comparable to heuristic methods while improving fidelity and robustness. Across multiple LLM families and sparsity patterns, CVR+EC yields consistent improvements in language modeling perplexity and zero-shot task performance, with favorable runtime and stability properties, suggesting that simple statistical corrections can substantially enhance post-training pruning outcomes.

Abstract

Post-training pruning is an effective approach for reducing the size and inference cost of large language models (LLMs), but existing methods often face a trade-off between pruning quality and computational efficiency. Heuristic pruning methods are efficient but sensitive to activation outliers, while reconstruction-based approaches improve fidelity at the cost of heavy computation. In this work, we propose a lightweight post-training pruning framework based on first-order statistical properties of model weights and activations. During pruning, channel-wise statistics are used to calibrate magnitude-based importance scores, reducing bias from activation-dominated channels. After pruning, we apply an analytic energy compensation to correct distributional distortions caused by weight removal. Both steps operate without retraining, gradients, or second-order information. Experiments across multiple LLM families, sparsity patterns, and evaluation tasks show that the proposed approach improves pruning performance while maintaining computational cost comparable to heuristic methods. The results suggest that simple statistical corrections can be effective for post-training pruning of LLMs.

Efficient Post-Training Pruning of Large Language Models with Statistical Correction

TL;DR

This work tackles the challenge of pruning large language models after training without incurring heavy computational costs. It introduces two key components: variance-calibrated weight selection (CVR), which modulates magnitude-based importance scores with channel variance to mitigate activation-outlier bias, and energy compensation (EC), a closed-form, non-gradient correction that realigns layer-wise activation energy after pruning. The approach relies solely on first-order statistics and does not require second-order information or retraining, enabling pruning costs comparable to heuristic methods while improving fidelity and robustness. Across multiple LLM families and sparsity patterns, CVR+EC yields consistent improvements in language modeling perplexity and zero-shot task performance, with favorable runtime and stability properties, suggesting that simple statistical corrections can substantially enhance post-training pruning outcomes.

Abstract

Post-training pruning is an effective approach for reducing the size and inference cost of large language models (LLMs), but existing methods often face a trade-off between pruning quality and computational efficiency. Heuristic pruning methods are efficient but sensitive to activation outliers, while reconstruction-based approaches improve fidelity at the cost of heavy computation. In this work, we propose a lightweight post-training pruning framework based on first-order statistical properties of model weights and activations. During pruning, channel-wise statistics are used to calibrate magnitude-based importance scores, reducing bias from activation-dominated channels. After pruning, we apply an analytic energy compensation to correct distributional distortions caused by weight removal. Both steps operate without retraining, gradients, or second-order information. Experiments across multiple LLM families, sparsity patterns, and evaluation tasks show that the proposed approach improves pruning performance while maintaining computational cost comparable to heuristic methods. The results suggest that simple statistical corrections can be effective for post-training pruning of LLMs.
Paper Structure (31 sections, 11 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 31 sections, 11 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of the proposed statistical correction framework. First-order statistics from pretrained weights $W$ and activations $X$ are used for variance-calibrated importance scoring and post-pruning energy compensation. The two components are applied independently and produce corrected weights while preserving the sparsity pattern.
  • Figure 2: Perplexity across calibration set sizes on LLaMA-2-13B with 50% unstructured sparsity. Lower is better.