Table of Contents
Fetching ...

FAIR-Pruner: Leveraging Tolerance of Difference for Flexible Automatic Layer-Wise Neural Network Pruning

Chenqing Lin, Mostafa Hussien, Chengyao Yu, Bingyi Jing, Mohamed Cheriet, Osama Abdelrahman, Ruixing Ming

TL;DR

FAIR-Pruner addresses the challenge of pruning neural networks with non-uniform, layer-wise sparsity by introducing ToD, a balance between a Wasserstein-based Utilization Score and a Taylor-based Reconstruction Score to determine per-layer pruning budgets. The method decouples threshold determination from importance estimation, enabling fast, flexible one-shot pruning controlled by a preset level $\alpha$, with per-layer counts $\hat{m}^{(l)}$ guided by ToD. Key components include the Use of a Wasserstein distance for unit-level discriminative power, the first-order Taylor approximation for loss impact, and a pruning strategy based on quantile thresholds for both scores. Empirically, FAIR-Pruner achieves state-of-the-art or competitive accuracy at high compression across CIFAR-10, SVHN, and ImageNet on architectures like VGG, AlexNet, ResNet, and DenseNet, while providing substantial speedups and low overhead. The approach also demonstrates that ToD can enhance existing saliency metrics (e.g., L1) by delivering better accuracy than uniform pruning, highlighting its practical impact for efficient deployment on edge devices.

Abstract

Neural network pruning has been widely adopted to reduce the parameter scale of complex neural networks, enabling efficient deployment on resource-limited edge devices. Mainstream pruning methods typically adopt uniform pruning strategies, which tend to cause a substantial performance degradation under high sparsity levels. Recent studies focus on non-uniform layer-wise pruning, but such approaches typically depend on global architecture optimization, which is computational expensive and lacks flexibility. To address these limitations, this paper proposes a novel method named Flexible Automatic Identification and Removal (FAIR)-Pruner, which adaptively determines the sparsity levels of each layer and identifies the units to be pruned. The core of FAIR-Pruner lies in the introduction of a novel indicator, Tolerance of Differences (ToD), designed to balance the importance scores obtained from two complementary perspectives: the architecture-level (Utilization Score) and the task-level (Reconstruction Score). By controlling ToD at preset levels, FAIR-Pruner determines layer-specific thresholds and removes units whose Utilization Scores fall below the corresponding thresholds. Furthermore, by decoupling threshold determination from importance estimation, FAIR-Pruner allows users to flexibly obtain pruned models under varying pruning ratios. Extensive experiments demonstrate that FAIR-Pruner achieves state-of-the-art performance, maintaining higher accuracy even at high compression ratios. Moreover, the ToD based layer-wise pruning ratios can be directly applied to existing powerful importance measurements, thereby improving the performance under uniform-pruning.

FAIR-Pruner: Leveraging Tolerance of Difference for Flexible Automatic Layer-Wise Neural Network Pruning

TL;DR

FAIR-Pruner addresses the challenge of pruning neural networks with non-uniform, layer-wise sparsity by introducing ToD, a balance between a Wasserstein-based Utilization Score and a Taylor-based Reconstruction Score to determine per-layer pruning budgets. The method decouples threshold determination from importance estimation, enabling fast, flexible one-shot pruning controlled by a preset level , with per-layer counts guided by ToD. Key components include the Use of a Wasserstein distance for unit-level discriminative power, the first-order Taylor approximation for loss impact, and a pruning strategy based on quantile thresholds for both scores. Empirically, FAIR-Pruner achieves state-of-the-art or competitive accuracy at high compression across CIFAR-10, SVHN, and ImageNet on architectures like VGG, AlexNet, ResNet, and DenseNet, while providing substantial speedups and low overhead. The approach also demonstrates that ToD can enhance existing saliency metrics (e.g., L1) by delivering better accuracy than uniform pruning, highlighting its practical impact for efficient deployment on edge devices.

Abstract

Neural network pruning has been widely adopted to reduce the parameter scale of complex neural networks, enabling efficient deployment on resource-limited edge devices. Mainstream pruning methods typically adopt uniform pruning strategies, which tend to cause a substantial performance degradation under high sparsity levels. Recent studies focus on non-uniform layer-wise pruning, but such approaches typically depend on global architecture optimization, which is computational expensive and lacks flexibility. To address these limitations, this paper proposes a novel method named Flexible Automatic Identification and Removal (FAIR)-Pruner, which adaptively determines the sparsity levels of each layer and identifies the units to be pruned. The core of FAIR-Pruner lies in the introduction of a novel indicator, Tolerance of Differences (ToD), designed to balance the importance scores obtained from two complementary perspectives: the architecture-level (Utilization Score) and the task-level (Reconstruction Score). By controlling ToD at preset levels, FAIR-Pruner determines layer-specific thresholds and removes units whose Utilization Scores fall below the corresponding thresholds. Furthermore, by decoupling threshold determination from importance estimation, FAIR-Pruner allows users to flexibly obtain pruned models under varying pruning ratios. Extensive experiments demonstrate that FAIR-Pruner achieves state-of-the-art performance, maintaining higher accuracy even at high compression ratios. Moreover, the ToD based layer-wise pruning ratios can be directly applied to existing powerful importance measurements, thereby improving the performance under uniform-pruning.

Paper Structure

This paper contains 29 sections, 1 theorem, 13 equations, 9 figures, 12 tables, 3 algorithms.

Key Result

Proposition 1

For any $l\in L$ and $j\in[J^{(l)}]$, let $\widehat{d}_j^{(l)}= \sup_{\substack{k_1\neq k_2 \in [K]}} d(\widehat{O}^{(l)}_{j, n_{k_1}}, \widehat{O}^{(l)}_{j, n_{k_2}})$ be an estimator of $\mathcal{U}_{j}^{(l)}$. Suppose $K <\infty$ and $E|O_j^{(l)}(Z_{k})|<\infty$ for any $k\in[K]$. Then as $n_{k_1 where a.s. represents the almost surely coverage.

Figures (9)

  • Figure 1: Pruning strategies comparison. Uniform pruning allocates fixed sparsity across layers, while architecture search optimizes layer widths but is computationally expensive. Our ToD control adaptively assigns layer-wise sparsity via a preset parameter $\alpha$ (defined in \ref{['eq:tau']} and \ref{['eq:hat_m']}), enabling global compression control with negligible overhead.
  • Figure 2: Layer-wise pruning rates on VGG16 simonyan2014very (left) and AlexNet krizhevsky2012imagenet (right).
  • Figure 3: Ablation study on the impact of the Utilization Score and ToD-based non-uniform layer-wise pruning rates on One-Shot Accuracy. "Random+ToD" refers to pruning channels randomly but with the ToD-based pruning rates. The shaded area represents the region within one standard deviation of the random experiment over 20 independent trials.
  • Figure 4: ToD level versus achieved pruning rate.
  • Figure 5: Empirical complexity of FAIR-Pruner. Measured pruning time on VGG16 across increasing sample sizes. Experiments conducted on an NVIDIA A100 GPU.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Remark 1: Rationality of the U-Score
  • Proposition 1
  • Remark 2