Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization
Sungbin Shin, Wonpyo Park, Jaeho Lee, Namhoon Lee
TL;DR
Pruning large language models under memory constraints often relies on reconstructing dense predictions, which can incur compounding errors. The authors introduce block-wise reconstruction, global propagation, and cross-block reconstruction to minimize reconstruction error within a divide-and-conquer pruning framework, and they analyze the role of self-generated calibration data to balance generalization. Their results show that reconstruction techniques can substantially reduce error and improve perplexity and some downstream tasks, but aggressive error minimization can overfit calibration data, particularly for larger models; self-generated calibration data can mitigate this risk. Overall, the work provides a practical pathway for memory-efficient LLM pruning and highlights a reconstruction-generalization trade-off that informs future robustness-focused pruning strategies. The reconstruction objective can be framed as $\min_{w,m} \| f(\bar{w}; \mathcal{D}) - f(m \odot w; \mathcal{D}) \|_2^2$ subject to $\|m\|_0 \le k$, linking pruning masks to preserved predictive behavior on calibration data.
Abstract
This work suggests fundamentally rethinking the current practice of pruning large language models (LLMs). The way it is done is by divide and conquer: split the model into submodels, sequentially prune them, and reconstruct predictions of the dense counterparts on small calibration data one at a time; the final model is obtained simply by putting the resulting sparse submodels together. While this approach enables pruning under memory constraints, it generates high reconstruction errors. In this work, we first present an array of reconstruction techniques that can significantly reduce this error by more than $90\%$. Unwittingly, however, we discover that minimizing reconstruction error is not always ideal and can overfit the given calibration data, resulting in rather increased language perplexity and poor performance at downstream tasks. We find out that a strategy of self-generating calibration data can mitigate this trade-off between reconstruction and generalization, suggesting new directions in the presence of both benefits and pitfalls of reconstruction for pruning LLMs.
