Layer-adaptive sparsity for the Magnitude-based Pruning
Jaeho Lee, Sejun Park, Sangwoo Mo, Sungsoo Ahn, Jinwoo Shin
TL;DR
The paper tackles the ambiguous choice of layerwise sparsity in magnitude-based pruning by introducing Layer-Adaptive Magnitude-based Pruning (LAMP). LAMP defines a per-weight score that normalizes weight magnitude by the surviving-weight energy within a layer, enabling global pruning to implicitly yield optimal layerwise sparsity without tuning. The approach is theoretically motivated by minimizing model-level output distortion and is validated across diverse CNN architectures and image datasets, as well as language modeling tasks, consistently outperforming baseline sparsity schemes and showing robustness in ablations. It also reveals that LAMP recovers intuitive heuristics (e.g., preserving early and late layers) and yields comparatively uniform sparsity distribution across layers at high sparsity levels, suggesting practical benefits for memory capacity and expressivity.
Abstract
Recent discoveries on neural network pruning reveal that, with a carefully chosen layerwise sparsity, a simple magnitude-based pruning achieves state-of-the-art tradeoff between sparsity and performance. However, without a clear consensus on "how to choose," the layerwise sparsities are mostly selected algorithm-by-algorithm, often resorting to handcrafted heuristics or an extensive hyperparameter search. To fill this gap, we propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score; the score is a rescaled version of weight magnitude that incorporates the model-level $\ell_2$ distortion incurred by pruning, and does not require any hyperparameter tuning or heavy computation. Under various image classification setups, LAMP consistently outperforms popular existing schemes for layerwise sparsity selection. Furthermore, we observe that LAMP continues to outperform baselines even in weight-rewinding setups, while the connectivity-oriented layerwise sparsity (the strongest baseline overall) performs worse than a simple global magnitude-based pruning in this case. Code: https://github.com/jaeho-lee/layer-adaptive-sparsity
