Table of Contents
Fetching ...

How Sparse Can We Prune A Deep Network: A Fundamental Limit Perspective

Qiaozhe Zhang, Ruijie Zhang, Jun Sun, Yingzhuang Liu

TL;DR

Addressing the fundamental limit of pruning in deep networks, the paper formulates pruning as a sparsity-constrained loss feasibility problem using the loss sublevel set $S(\epsilon)$ and analyzes it with convex-geometry tools (Gaussian width, statistical dimension) and the Approximate Kinematics Formula. It derives computable lower and upper bounds on the pruning ratio, showing a sharp fundamental limit that depends on weight magnitude (\|\mathbf{w}^*-\mathbf{w}^k\|) and network sharpness (trace of the Hessian $\mathrm{Tr}(H)$). An $l_1$-regularization based one-shot magnitude pruning (LOMP) scheme is proposed and paired with improved Hessian-spectrum estimation to approach the limit, with experiments across CIFAR/TinyImageNet-ResNet/Alex/VGG showing close agreement between theory and practice. The results also offer rigorous interpretations of existing pruning heuristics (e.g., gradual pruning, the role of $l_2$ regularization) and provide practical guidance for achieving near-optimal pruning without significant accuracy loss.

Abstract

Network pruning is a commonly used measure to alleviate the storage and computational burden of deep neural networks. However, the fundamental limit of network pruning is still lacking. To close the gap, in this work we'll take a first-principles approach, i.e. we'll directly impose the sparsity constraint on the loss function and leverage the framework of statistical dimension in convex geometry, thus enabling us to characterize the sharp phase transition point, which can be regarded as the fundamental limit of the pruning ratio. Through this limit, we're able to identify two key factors that determine the pruning ratio limit, namely, weight magnitude and network sharpness. Generally speaking, the flatter the loss landscape or the smaller the weight magnitude, the smaller pruning ratio. Moreover, we provide efficient countermeasures to address the challenges in the computation of the pruning limit, which mainly involves the accurate spectrum estimation of a large-scale and non-positive Hessian matrix. Moreover, through the lens of the pruning ratio threshold, we can also provide rigorous interpretations on several heuristics in existing pruning algorithms. Extensive experiments are performed which demonstrate that our theoretical pruning ratio threshold coincides very well with the experiments. All codes are available at: https://github.com/QiaozheZhang/Global-One-shot-Pruning

How Sparse Can We Prune A Deep Network: A Fundamental Limit Perspective

TL;DR

Addressing the fundamental limit of pruning in deep networks, the paper formulates pruning as a sparsity-constrained loss feasibility problem using the loss sublevel set and analyzes it with convex-geometry tools (Gaussian width, statistical dimension) and the Approximate Kinematics Formula. It derives computable lower and upper bounds on the pruning ratio, showing a sharp fundamental limit that depends on weight magnitude (\|\mathbf{w}^*-\mathbf{w}^k\|) and network sharpness (trace of the Hessian ). An -regularization based one-shot magnitude pruning (LOMP) scheme is proposed and paired with improved Hessian-spectrum estimation to approach the limit, with experiments across CIFAR/TinyImageNet-ResNet/Alex/VGG showing close agreement between theory and practice. The results also offer rigorous interpretations of existing pruning heuristics (e.g., gradual pruning, the role of regularization) and provide practical guidance for achieving near-optimal pruning without significant accuracy loss.

Abstract

Network pruning is a commonly used measure to alleviate the storage and computational burden of deep neural networks. However, the fundamental limit of network pruning is still lacking. To close the gap, in this work we'll take a first-principles approach, i.e. we'll directly impose the sparsity constraint on the loss function and leverage the framework of statistical dimension in convex geometry, thus enabling us to characterize the sharp phase transition point, which can be regarded as the fundamental limit of the pruning ratio. Through this limit, we're able to identify two key factors that determine the pruning ratio limit, namely, weight magnitude and network sharpness. Generally speaking, the flatter the loss landscape or the smaller the weight magnitude, the smaller pruning ratio. Moreover, we provide efficient countermeasures to address the challenges in the computation of the pruning limit, which mainly involves the accurate spectrum estimation of a large-scale and non-positive Hessian matrix. Moreover, through the lens of the pruning ratio threshold, we can also provide rigorous interpretations on several heuristics in existing pruning algorithms. Extensive experiments are performed which demonstrate that our theoretical pruning ratio threshold coincides very well with the experiments. All codes are available at: https://github.com/QiaozheZhang/Global-One-shot-Pruning
Paper Structure (55 sections, 19 theorems, 52 equations, 8 figures, 11 tables, 3 algorithms)

This paper contains 55 sections, 19 theorems, 52 equations, 8 figures, 11 tables, 3 algorithms.

Key Result

Theorem 2.4

Let $\mathcal{C}$ be a convex conic hull of a sublevel set $S(\epsilon)$ in $\mathbb{R}^D$, and draw a random orthogonal basis ${\bf Q} \in \mathbb{R}^{D\times D}$. For a $k$-dimensional subspace $S_k$, it holds that:

Figures (8)

  • Figure 1: Panel (a, b): Illustration of a convex conic hull and the statistical dimension. Panel (c): Effect of projection distance on projection size and intersection probability.
  • Figure 2: Effect of extremely small projection distance on projection size and intersection probability and statistics of ResNet50 on TinyImagenet. Statistics regarding all experiments can be found in Appendix \ref{['results']}.
  • Figure 3: The impact of sparsity on loss and test accuracy are obtained on the test dataset, and we mark the theoretical pruning ratio limit with vertical lines. The loss values have been normalized and translated.
  • Figure 4: Top Row: From left to right, as the number of iterations increases, it leads to an increase in the theoretical pruning ratio threshold. The horizontal line represents the last pruning ratio. Bottom Row: The comparison of the pruning ratio threshold between using and not using $l_2$-regularization. Sparse networks are obtained by magnitude-based pruning with fixed pruning ratios. The two plots on the left and the two plots on the right correspond to different fixed pruning ratios. Here, $R=\Vert {\bf w}^* - {\bf w}^k\Vert_2$, which is the projection distance.
  • Figure 5: The theoretically predicted pruning ratio in eight tasks. The first row, from left to right, corresponds to FC5, FC12, AlexNet, and VGG16 on CIFAR10. The second row, from left to right, corresponds to ResNet18 and ResNet50 on CIFAR100, as well as ResNet18 and ResNet50 on TinyImagenet.
  • ...and 3 more figures

Theorems & Definitions (24)

  • Definition 2.1: Convex Cone & Conic Hull
  • Definition 2.2: Gaussian Width vershynin2014estimation
  • Definition 2.3: Statistical Dimension amelunxen2014living
  • Theorem 2.4: Approximate Kinematics Formula, Theorem 7.1 of amelunxen2014living
  • Theorem 3.1: Gaussian Width vs. Statistical Dimension, Proposition 10.2 of amelunxen2014living
  • Theorem 3.2: Lower Bound of Pruning Ratio
  • Lemma 3.3
  • Corollary 3.4
  • Lemma 3.5: Pruning Ratio vs. Sharpness
  • Theorem 3.6: Upper Bound of Pruning Ratio
  • ...and 14 more