Table of Contents
Fetching ...

Less is KEN: a Universal and Simple Non-Parametric Pruning Algorithm for Large Language Models

Michele Mastromattei, Fabio Massimo Zanzotto

TL;DR

This work introduces KEN, a universal, non-parametric pruning method for large language models based on Kernel Density Estimation. By evaluating per-row parameter distributions and preserving only the most KDE-supported entries while resetting others to pre-trained values, KEN constructs compact subnetworks without architecture-specific constraints. Across seven transformer models and varied datasets, KEN achieves 25%–70% parameter reduction with equal or improved performance compared to unpruned baselines and other pruning/PEFT methods, and it provides KEN_viz for explainability. The approach promises practical storage and deployment benefits by enabling subnetwork storage and dynamic reconfiguration, with a clear visualization tool to understand the pruning implications.

Abstract

Neural network pruning has become increasingly crucial due to the complexity of these models and their widespread use in various fields. Existing pruning algorithms often suffer from limitations such as architecture specificity, excessive complexity and reliance on demanding calculations, rendering them impractical for real-world applications. This paper introduces KEN: a straightforward, universal and unstructured pruning algorithm based on Kernel Density Estimation (KDE). KEN aims to construct optimized transformers by selectively preserving the most significant parameters while restoring others to their pre-training state. This strategy preserves model performance while enabling storage of only the optimized subnetwork, leading to substantial memory savings. Extensive evaluations across seven different LLMs demonstrate that KEN achieves equal or better performance than their original unpruned versions, with a minimum parameter reduction of 25%. Furthermore, in-depth comparisons with established pruning and PEFT algorithms confirm KEN effectiveness. We further introduce KEN$_{viz}$, an explainable tool that visualizes the optimized model composition achieved by KEN from different points of view.

Less is KEN: a Universal and Simple Non-Parametric Pruning Algorithm for Large Language Models

TL;DR

This work introduces KEN, a universal, non-parametric pruning method for large language models based on Kernel Density Estimation. By evaluating per-row parameter distributions and preserving only the most KDE-supported entries while resetting others to pre-trained values, KEN constructs compact subnetworks without architecture-specific constraints. Across seven transformer models and varied datasets, KEN achieves 25%–70% parameter reduction with equal or improved performance compared to unpruned baselines and other pruning/PEFT methods, and it provides KEN_viz for explainability. The approach promises practical storage and deployment benefits by enabling subnetwork storage and dynamic reconfiguration, with a clear visualization tool to understand the pruning implications.

Abstract

Neural network pruning has become increasingly crucial due to the complexity of these models and their widespread use in various fields. Existing pruning algorithms often suffer from limitations such as architecture specificity, excessive complexity and reliance on demanding calculations, rendering them impractical for real-world applications. This paper introduces KEN: a straightforward, universal and unstructured pruning algorithm based on Kernel Density Estimation (KDE). KEN aims to construct optimized transformers by selectively preserving the most significant parameters while restoring others to their pre-training state. This strategy preserves model performance while enabling storage of only the optimized subnetwork, leading to substantial memory savings. Extensive evaluations across seven different LLMs demonstrate that KEN achieves equal or better performance than their original unpruned versions, with a minimum parameter reduction of 25%. Furthermore, in-depth comparisons with established pruning and PEFT algorithms confirm KEN effectiveness. We further introduce KEN, an explainable tool that visualizes the optimized model composition achieved by KEN from different points of view.
Paper Structure (22 sections, 7 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 22 sections, 7 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: How $k$ value influences the KDE calculation, driving the parameter selection
  • Figure 2: KEN workpath: From a fine-tuned model (1), for each of its fine-tuned matrices (2), the row distribution and the respective KDE (Kernel Density Estimator) are calculated. All values within the KDE are selected (3.a), while the remainder are restored to their pre-tuned value (3.b). The resulting optimized matrix (4) is then fed back into the model (5)
  • Figure 3: Comparing the impact of KEN parameter selection on the same fine-tuned matrix (a). Matrix (a) represents the in_proj matrix at layer 0 of a DeBERTa model trained on the AG_NEWS dataset. No selected parameters are blank
  • Figure 4: Comparison between KEN and LoRA. Labels for the LoRA marker indicate the dimension of the rank-decomposition matrix analyzed while, for KEN, the $k$ value used
  • Figure 5: Output of KEN$_{viz}$ of the key attention matrix at layer 12 of a BERT model trained on glue-sst2. Reset parameters 47.92%
  • ...and 3 more figures