Table of Contents
Fetching ...

Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization

Sungbin Shin, Wonpyo Park, Jaeho Lee, Namhoon Lee

TL;DR

Pruning large language models under memory constraints often relies on reconstructing dense predictions, which can incur compounding errors. The authors introduce block-wise reconstruction, global propagation, and cross-block reconstruction to minimize reconstruction error within a divide-and-conquer pruning framework, and they analyze the role of self-generated calibration data to balance generalization. Their results show that reconstruction techniques can substantially reduce error and improve perplexity and some downstream tasks, but aggressive error minimization can overfit calibration data, particularly for larger models; self-generated calibration data can mitigate this risk. Overall, the work provides a practical pathway for memory-efficient LLM pruning and highlights a reconstruction-generalization trade-off that informs future robustness-focused pruning strategies. The reconstruction objective can be framed as $\min_{w,m} \| f(\bar{w}; \mathcal{D}) - f(m \odot w; \mathcal{D}) \|_2^2$ subject to $\|m\|_0 \le k$, linking pruning masks to preserved predictive behavior on calibration data.

Abstract

This work suggests fundamentally rethinking the current practice of pruning large language models (LLMs). The way it is done is by divide and conquer: split the model into submodels, sequentially prune them, and reconstruct predictions of the dense counterparts on small calibration data one at a time; the final model is obtained simply by putting the resulting sparse submodels together. While this approach enables pruning under memory constraints, it generates high reconstruction errors. In this work, we first present an array of reconstruction techniques that can significantly reduce this error by more than $90\%$. Unwittingly, however, we discover that minimizing reconstruction error is not always ideal and can overfit the given calibration data, resulting in rather increased language perplexity and poor performance at downstream tasks. We find out that a strategy of self-generating calibration data can mitigate this trade-off between reconstruction and generalization, suggesting new directions in the presence of both benefits and pitfalls of reconstruction for pruning LLMs.

Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization

TL;DR

Pruning large language models under memory constraints often relies on reconstructing dense predictions, which can incur compounding errors. The authors introduce block-wise reconstruction, global propagation, and cross-block reconstruction to minimize reconstruction error within a divide-and-conquer pruning framework, and they analyze the role of self-generated calibration data to balance generalization. Their results show that reconstruction techniques can substantially reduce error and improve perplexity and some downstream tasks, but aggressive error minimization can overfit calibration data, particularly for larger models; self-generated calibration data can mitigate this risk. Overall, the work provides a practical pathway for memory-efficient LLM pruning and highlights a reconstruction-generalization trade-off that informs future robustness-focused pruning strategies. The reconstruction objective can be framed as subject to , linking pruning masks to preserved predictive behavior on calibration data.

Abstract

This work suggests fundamentally rethinking the current practice of pruning large language models (LLMs). The way it is done is by divide and conquer: split the model into submodels, sequentially prune them, and reconstruct predictions of the dense counterparts on small calibration data one at a time; the final model is obtained simply by putting the resulting sparse submodels together. While this approach enables pruning under memory constraints, it generates high reconstruction errors. In this work, we first present an array of reconstruction techniques that can significantly reduce this error by more than . Unwittingly, however, we discover that minimizing reconstruction error is not always ideal and can overfit the given calibration data, resulting in rather increased language perplexity and poor performance at downstream tasks. We find out that a strategy of self-generating calibration data can mitigate this trade-off between reconstruction and generalization, suggesting new directions in the presence of both benefits and pitfalls of reconstruction for pruning LLMs.
Paper Structure (21 sections, 2 equations, 7 figures, 7 tables)

This paper contains 21 sections, 2 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: (a) Reconstruction techniques significantly reduce the compounding errors and lead to a substantial reduction of error in the final block. Reconstruction o and x refer to the results with and without the proposed reconstruction techniques (br, gp, cr) respectively. (b) Minimizing reconstruction error may not always be ideal since models can overfit calibration data (we show this in \ref{['sec:exp-generalization']}). Using our self-generated calibration data in the reconstruction process mitigates this issue quite effectively by decreasing test error, perplexity, and error rates for downstream tasks.
  • Figure 2: An illustration of reconstruction techniques for pruning large language models. Here, we want the sparse model $f(m \odot w; \cdot)$ to reconstruct the prediction of the dense model on some calibration data $\mathcal{D}$. lr, br, gp, and cr each correspond to layer-wise reconstruction, block-wise reconstruction, global propagation, and cross-block reconstruction. Here, solid and dashed arrows each represent the inputs coming from sparse and dense models.
  • Figure 3: Results of reconstruction techniques for LLaMA-7B. They constantly reduce the compounding errors, achieving a significant decrease at the final block ($\sim 90\%$). We find this trend is consistent across different settings. See \ref{['fig:recon-error-app', 'fig:recon-error-app-opt']} of \ref{['sec:app-results']} for more results.
  • Figure 4: Effects of self-generated calibration data on (a) reconstruction error for test data (raw-Wikitext2) and (b) perplexity for LLaMA-7B; they both improve with more self-generation. See \ref{['fig:gendata-app']} of \ref{['sec:app-results']} for more results.
  • Figure 5: Results of reconstruction techniques for LLaMA-7B. They constantly reduce the compounding errors, achieving a significant decrease at the final block ($87\% \sim 94\%$).
  • ...and 2 more figures