Table of Contents
Fetching ...

High-Fidelity Pruning for Large Language Models

Yijun Zhu, Jianxin Wang, Chengchao Shen

TL;DR

Information entropy of the model's output distribution of the model's output distribution is proposed, to efficiently evaluate importance scores of neurons with Taylor pruning without requirement of additional teacher, and provides a more holistic criterion for Taylor pruning to prune neurons with the least impact on the prediction of model in a global manner.

Abstract

Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks, yet their significant computational and memory requirements present major challenges for deployment. A common approach uses Taylor expansion on the loss function to estimate neuron importance. However, its reliance on one-hot cross entropy loss, a key limitation is that it narrowly assesses importance based only on the probability assigned to the single predicted next token, thereby ignoring the other potential predictions of the original model. An intuitive solution to address this is to employ self distillation criterion for importance evaluation. However, this approach introduces significant computational overhead by requiring a separate teacher model for supervision. To this end, we propose a simple but effective criterion, information entropy of the model's output distribution, to efficiently evaluate importance scores of neurons with Taylor pruning without requirement of additional teacher. Compared to plain cross entropy criterion, it provides a more holistic criterion for Taylor pruning to prune neurons with the least impact on the prediction of model in a global manner, thereby preserving the fidelity of the model's predictive capabilities. Experimental results on extensive zero-shot benchmarks demonstrate that our method consistently outperforms existing pruning methods across the LLaMA and Qwen series models. The source code and trained weights are availabel at https://github.com/visresearch/HFPrune.

High-Fidelity Pruning for Large Language Models

TL;DR

Information entropy of the model's output distribution of the model's output distribution is proposed, to efficiently evaluate importance scores of neurons with Taylor pruning without requirement of additional teacher, and provides a more holistic criterion for Taylor pruning to prune neurons with the least impact on the prediction of model in a global manner.

Abstract

Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks, yet their significant computational and memory requirements present major challenges for deployment. A common approach uses Taylor expansion on the loss function to estimate neuron importance. However, its reliance on one-hot cross entropy loss, a key limitation is that it narrowly assesses importance based only on the probability assigned to the single predicted next token, thereby ignoring the other potential predictions of the original model. An intuitive solution to address this is to employ self distillation criterion for importance evaluation. However, this approach introduces significant computational overhead by requiring a separate teacher model for supervision. To this end, we propose a simple but effective criterion, information entropy of the model's output distribution, to efficiently evaluate importance scores of neurons with Taylor pruning without requirement of additional teacher. Compared to plain cross entropy criterion, it provides a more holistic criterion for Taylor pruning to prune neurons with the least impact on the prediction of model in a global manner, thereby preserving the fidelity of the model's predictive capabilities. Experimental results on extensive zero-shot benchmarks demonstrate that our method consistently outperforms existing pruning methods across the LLaMA and Qwen series models. The source code and trained weights are availabel at https://github.com/visresearch/HFPrune.
Paper Structure (28 sections, 4 equations, 4 figures, 14 tables, 1 algorithm)

This paper contains 28 sections, 4 equations, 4 figures, 14 tables, 1 algorithm.

Figures (4)

  • Figure 1: One-hot label loss criterion vs our proposed information entropy criterion. The cross entropy criterion (left) measures the model prediction by only label-related prediction. Based on this criterion, Taylor-based pruning only minimizes the change of label-related prediction after pruning. In contrast, our proposed information entropy (right) fully represent holistic predictions, thus minimizing the change of global prediction distribution for better performance preservation.
  • Figure 2: The overview of our high fidelity prune method. We focus on the pruning of hidden neurons $h$ of parameter-intensive MLP modules of LLM Transformer. To this end, we apply information entropy of model prediction to evaluate the importance scores of hidden neurons $h$ based on Taylor expansion. Then, we rank the hidden neurons $h$ by the obtained importance scores and prune the least important neurons, thus reducing model size while minimizing performance degradation.
  • Figure 3: Sample1: You are able to depend on Local Bathroom Remodel Crew to deliver the very best expert services when it comes to Bathroom Remodeling in Williamstown, NJ. Our crew of experienced experts will provide the expert services that you require with the most innovative technologies around. We make sure
  • Figure 4: Sample2: At Machu Picchu, be part of up With all the travellers as part of your group who hiked the common Inca Trail. If skies are very clear, get pleasure from a spectacular views more than The traditional metropolis from your Sunlight Gate, right before going on a guided stroll round the ruins. Mobile Massage Can Save The Environment! A more pure, not to say