Table of Contents
Fetching ...

Effective Model Pruning

Yixuan Wang, Dan Guralnik, Saiedeh Akbari, Warren Dixon

TL;DR

Effective Model Pruning (EMP) develops a universal, context-agnostic pruning threshold that converts any score distribution into an adaptive budget $N_{ ext{eff}} = ig\lfloor 1/\sum_i \omega_i^2 \big\rfloor$ with $\\omega_i = |s_i|/\sum_j |s_j|$. By retaining the top $N_{ ext{eff}}$ entries, EMP achieves sparse models with performance close to dense baselines across MLPs, CNNs, Transformers/LLMs, and KAN, without architecture-specific budgets or tuning. The approach is bolstered by a tight lower bound on the preserved mass $s_{ ext{eff}}$ via simplex geometry and an upper bound on the loss change $\\epsilon$ for magnitude-based pruning, plus an efficient $O(N \log N)$ algorithm with a tunable $\\beta$ to meet hardware constraints. Empirically, EMP delivers robust pruning across FCs, CNNs, KAN, LLMs, and featurewise image pruning, achieving substantial sparsity with minimal performance degradation and often outperforming fixed-sparsity baselines when paired with magnitude criteria.

Abstract

We introduce Effective Model Pruning (EMP), a context-agnostic, parameter-free rule addressing a fundamental question about pruning: how many entries to keep. EMP does not prescribe how to score the parameters or prune the models; instead, it supplies a universal adaptive threshold that can be applied to any pruning criterion: weight magnitude, attention score, KAN importance score, or even feature-level signals such as image pixel, and used on structural parts or weights of the models. Given any score vector s, EMP maps s to a built-in effective number N_eff which is inspired by the Inverse Simpson index of contributors. Retaining the N_eff highest scoring entries and zeroing the remainder yields sparse models with performance comparable to the original dense networks across MLPs, CNNs, Transformers/LLMs, and KAN, in our experiments. By leveraging the geometry of the simplex, we derive a tight lower bound on the preserved mass s_eff (the sum of retained scores) over the corresponding ordered probability simplex associated with the score vector s. We further verify the effectiveness of N_eff by pruning the model with a scaled threshold \b{eta}*N_eff across a variety of criteria and models. Experiments suggest that the default \b{eta} = 1 yields a robust threshold for model pruning while \b{eta} not equal to 1 still serves as an optional adjustment to meet specific sparsity requirements.

Effective Model Pruning

TL;DR

Effective Model Pruning (EMP) develops a universal, context-agnostic pruning threshold that converts any score distribution into an adaptive budget with . By retaining the top entries, EMP achieves sparse models with performance close to dense baselines across MLPs, CNNs, Transformers/LLMs, and KAN, without architecture-specific budgets or tuning. The approach is bolstered by a tight lower bound on the preserved mass via simplex geometry and an upper bound on the loss change for magnitude-based pruning, plus an efficient algorithm with a tunable to meet hardware constraints. Empirically, EMP delivers robust pruning across FCs, CNNs, KAN, LLMs, and featurewise image pruning, achieving substantial sparsity with minimal performance degradation and often outperforming fixed-sparsity baselines when paired with magnitude criteria.

Abstract

We introduce Effective Model Pruning (EMP), a context-agnostic, parameter-free rule addressing a fundamental question about pruning: how many entries to keep. EMP does not prescribe how to score the parameters or prune the models; instead, it supplies a universal adaptive threshold that can be applied to any pruning criterion: weight magnitude, attention score, KAN importance score, or even feature-level signals such as image pixel, and used on structural parts or weights of the models. Given any score vector s, EMP maps s to a built-in effective number N_eff which is inspired by the Inverse Simpson index of contributors. Retaining the N_eff highest scoring entries and zeroing the remainder yields sparse models with performance comparable to the original dense networks across MLPs, CNNs, Transformers/LLMs, and KAN, in our experiments. By leveraging the geometry of the simplex, we derive a tight lower bound on the preserved mass s_eff (the sum of retained scores) over the corresponding ordered probability simplex associated with the score vector s. We further verify the effectiveness of N_eff by pruning the model with a scaled threshold \b{eta}*N_eff across a variety of criteria and models. Experiments suggest that the default \b{eta} = 1 yields a robust threshold for model pruning while \b{eta} not equal to 1 still serves as an optional adjustment to meet specific sparsity requirements.

Paper Structure

This paper contains 26 sections, 3 theorems, 32 equations, 4 figures, 9 tables, 1 algorithm.

Key Result

Lemma 1

Given a well-trained neural network $f(\theta^*,x)$, let $\epsilon$ denote the loss difference, $|L(\theta^\ast)-L(\theta^k)|$, between the dense network and its pruned version, and let $H$ denote the Hessian matrix of the loss function $L$ with respect to the parameter, $\theta$. Then, where $\mathop{\mathrm{Tr}}\nolimits(H)$ is the trace of the matrix $H$.

Figures (4)

  • Figure 1: Illustration of the $B_\nu$ balls ($\nu = 1,2,3,4$) and the simplex $\Delta$. Note that ball $B_4$ degenerates to the barycenter ${\zeta_{[4]}}$.
  • Figure 2: Lower and upper bounds associated with pruning. The left panel illustrates the tight lower bound of the effective mass ${s_{{e\!f\!f}}}$ as a function of ${N_{{e\!f\!f}}}$ for $N=1000$. The right panel depicts the normalized upper bound of the loss change, $\epsilon /(\Vert \theta^*\Vert_1^2\mathop{\mathrm{Tr}}\nolimits(H))$, showing its rapid decay as $\rho$ increases.
  • Figure 3: Test Accuracy of EMP-pruned models across different values of $\beta$. We examine 6 discrete values of $\beta=\{0.5,0.75,1,1.25,1.5,2\}$ to demonstrate that ${N_{{e\!f\!f}}}$ is a robust pruning threshold across different models and methods and tested on MNIST (solid) and Fashion-MNIST (dashed) datasets.
  • Figure 4: EMP magnitude pruning on an RGB image. Left: Original image (Figure Credit: https://www.pexels.com/photo/scenic-view-of-goreme-in-cappadocia-turkey-34012268/) Middle: EMP global magnitude pruning applied independently to each RGB channel. Right: patchwise EMP magnitiude pruning with local EMP applied on non-overlapping $4\times 4$ patches. The global method retains $PSNR =29.4 dB$ and $SSIM=0.912$ at sparsity $0.267$, while the patchwise method achieves higher fidelity ($PSNR =38.3 dB, SSIM =0.991$) at increased sparsity $0.323$.

Theorems & Definitions (4)

  • Lemma 1
  • Proposition 1
  • Theorem 2
  • proof