Table of Contents
Fetching ...

The Neural Pruning Law Hypothesis

Eugen Barbulescu, Antonio Alexoaie, Lucian Busoniu

TL;DR

The Neural Pruning Law Hypothesis addresses how to uniformly characterize pruning by proposing a principled flux-based mechanism (Hyperflux) that combines weight flux with a global pressure to reveal weight importance. It introduces an $L_0$ pruning framework and demonstrates a density-flux power-law relation, $\,\ln(s)=\ln(c)-\alpha_0\,\ln(\gamma)$, which the authors argue should hold across salient pruning metrics. Empirically, Hyperflux achieves competitive or superior sparsity-accuracy tradeoffs on CIFAR-10/100 and ImageNet-1K across magnitude, gradient, and $L_0$ pruning families, suggesting a unifying property of neural pruning. The work lays a foundation for principled sparse subnetwork discovery with potential impact on deploying efficient models on resource-constrained devices and informs future research in broader domains such as NLP and reinforcement learning.

Abstract

Network pruning is used to reduce inference latency and power consumption in large neural networks. However, most current pruning methods rely on ad-hoc heuristics that are poorly understood. We introduce Hyperflux, a conceptually-grounded pruning method, and use it to study the pruning process. Hyperflux models this process as an interaction between weight flux, the gradient's response to the weight's removal, and network pressure, a global regularization driving weights towards pruning. We postulate properties that arise naturally from our framework and find that the relationship between minimum flux among weights and density follows a power-law equation. Furthermore, we hypothesize the power-law relationship to hold for any effective saliency metric and call this idea the Neural Pruning Law Hypothesis. We validate our hypothesis on several families of pruning methods (magnitude, gradients, $L_0$), providing a potentially unifying property for neural pruning.

The Neural Pruning Law Hypothesis

TL;DR

The Neural Pruning Law Hypothesis addresses how to uniformly characterize pruning by proposing a principled flux-based mechanism (Hyperflux) that combines weight flux with a global pressure to reveal weight importance. It introduces an pruning framework and demonstrates a density-flux power-law relation, , which the authors argue should hold across salient pruning metrics. Empirically, Hyperflux achieves competitive or superior sparsity-accuracy tradeoffs on CIFAR-10/100 and ImageNet-1K across magnitude, gradient, and pruning families, suggesting a unifying property of neural pruning. The work lays a foundation for principled sparse subnetwork discovery with potential impact on deploying efficient models on resource-constrained devices and informs future research in broader domains such as NLP and reinforcement learning.

Abstract

Network pruning is used to reduce inference latency and power consumption in large neural networks. However, most current pruning methods rely on ad-hoc heuristics that are poorly understood. We introduce Hyperflux, a conceptually-grounded pruning method, and use it to study the pruning process. Hyperflux models this process as an interaction between weight flux, the gradient's response to the weight's removal, and network pressure, a global regularization driving weights towards pruning. We postulate properties that arise naturally from our framework and find that the relationship between minimum flux among weights and density follows a power-law equation. Furthermore, we hypothesize the power-law relationship to hold for any effective saliency metric and call this idea the Neural Pruning Law Hypothesis. We validate our hypothesis on several families of pruning methods (magnitude, gradients, ), providing a potentially unifying property for neural pruning.

Paper Structure

This paper contains 29 sections, 21 equations, 16 figures, 3 tables, 4 algorithms.

Figures (16)

  • Figure 1: Scenarios for $\theta_i$ when $H(t_i)=0$. If $\mathcal{A}_i$ points towards $\omega_i$ the flux $\mathcal{G}^-_i$ regrows the weight as in (a) and (d). Otherwise, it keeps the weight pruned as in (b) and (c). Numerical values are only illustrative.
  • Figure 2: Convergence for fixed $\gamma = 2$.
  • Figure 3: The relationship between $\gamma$ (minimum flux) and final density for ResNet-50, Cifar-10. The 3 regions we discussed are highlighted in the figure.
  • Figure 4: The relationship between minimum saliency among weights and density, for ResNet-50 Cifar-10 on Hyperflux (representing $L_0$ methods), Iterative magnitude pruning and Taylor approximation (representing gradient methods). The 3 regions as well as the power-law relation are clearly delimited in all 3 cases.
  • Figure 5: MNIST convergence for constant $\theta = 1$ for different learning rates
  • ...and 11 more figures