Table of Contents
Fetching ...

Max-Affine Spline Insights Into Deep Network Pruning

Haoran You, Randall Balestriero, Zhihan Lu, Yutong Kou, Huihong Shi, Shunyao Zhang, Shang Wu, Yingyan Celine Lin, Richard Baraniuk

TL;DR

This paper proposes to employ recent advances in the theoretical analysis of Continuous Piecewise Affine DNs to detect the early-bird (EB) ticket phenomenon, provide interpretability into current pruning techniques, and develop a principled pruning strategy.

Abstract

In this paper, we study the importance of pruning in Deep Networks (DNs) and the yin & yang relationship between (1) pruning highly overparametrized DNs that have been trained from random initialization and (2) training small DNs that have been "cleverly" initialized. As in most cases practitioners can only resort to random initialization, there is a strong need to develop a grounded understanding of DN pruning. Current literature remains largely empirical, lacking a theoretical understanding of how pruning affects DNs' decision boundary, how to interpret pruning, and how to design corresponding principled pruning techniques. To tackle those questions, we propose to employ recent advances in the theoretical analysis of Continuous Piecewise Affine (CPA) DNs. From this perspective, we will be able to detect the early-bird (EB) ticket phenomenon, provide interpretability into current pruning techniques, and develop a principled pruning strategy. In each step of our study, we conduct extensive experiments supporting our claims and results; while our main goal is to enhance the current understanding towards DN pruning instead of developing a new pruning method, our spline pruning criteria in terms of layerwise and global pruning is on par with or even outperforms state-of-the-art pruning methods.

Max-Affine Spline Insights Into Deep Network Pruning

TL;DR

This paper proposes to employ recent advances in the theoretical analysis of Continuous Piecewise Affine DNs to detect the early-bird (EB) ticket phenomenon, provide interpretability into current pruning techniques, and develop a principled pruning strategy.

Abstract

In this paper, we study the importance of pruning in Deep Networks (DNs) and the yin & yang relationship between (1) pruning highly overparametrized DNs that have been trained from random initialization and (2) training small DNs that have been "cleverly" initialized. As in most cases practitioners can only resort to random initialization, there is a strong need to develop a grounded understanding of DN pruning. Current literature remains largely empirical, lacking a theoretical understanding of how pruning affects DNs' decision boundary, how to interpret pruning, and how to design corresponding principled pruning techniques. To tackle those questions, we propose to employ recent advances in the theoretical analysis of Continuous Piecewise Affine (CPA) DNs. From this perspective, we will be able to detect the early-bird (EB) ticket phenomenon, provide interpretability into current pruning techniques, and develop a principled pruning strategy. In each step of our study, we conduct extensive experiments supporting our claims and results; while our main goal is to enhance the current understanding towards DN pruning instead of developing a new pruning method, our spline pruning criteria in terms of layerwise and global pruning is on par with or even outperforms state-of-the-art pruning methods.

Paper Structure

This paper contains 30 sections, 3 theorems, 8 equations, 15 figures, 12 tables, 1 algorithm.

Key Result

Proposition 1

Regardless of the type of pruning (weight/unit), setting entries of ${\bm{Q}}^{\ell}_W,{\bm{q}}_b^{\ell}$ to $0$, i.e. applying pruning, impacts both the per-region affine mappings ${\bm{A}}_{\omega},{\bm{b}}_{\omega}$and the DN input space partition $\Omega$.

Figures (15)

  • Figure 1: (a) Input space partitioning presents how deeper layers successively subdivide the space in a toy DN with 2 dimensional input space and three layers: $\mathcal{X}_0 \in \mathbb{R}^{2} \rightarrow \mathcal{X}_1 \in \mathbb{R}^{6} \rightarrow \mathcal{X}_2 \in \mathbb{R}^{6} \rightarrow \mathcal{X}_3 \in \mathbb{R}^{1}$, where the newly introduced boundaries are in dark and previously built ones are in grey. We see that:(i) the turning point of splines in later layers are exactly located at previous ones, and (ii) splines in the final classification layer are exactly the decision boundary (denoted as blue lines). Additional examples are supplied in Appendix \ref{['app:visualization']}; (b) Node (structured) pruning removes entire subdivision splines; (c) Weight (unstructured) pruning quantizes the partition splines to be colinear to the space axes. Both (b) and (c) are conceptual diagrams to explain how pruning incurs the less expressiveness of the final decision boundary.
  • Figure 2: Classification task pruning using FCNets, where the blue lines represent subdivisions in the first layer and the red lines denote the last layer's decision boundary. We see that:(i) pruning indeed removes redundant subdivision lines so that the decision boundary remains an $X$-shape until 80% nodes are pruned; and (ii) ideally, one blue subdivision line would be sufficient to provide two turning points for the decision boundary, e.g., visualization at 80% sparsity. The middle figure visualizes the accuracy and training speeds (on a NVIDIA 2080Ti GPU) of the adopted FCNet under various pruning ratios. The general trend is that, the more nodes we prune, the faster is the training at the cost of degraded accuracy.
  • Figure 3: Classification task pruning using ConvNets, where to produce these visuals, we choose two images from different classes to obtain a $2$-dimensional slice of the $764$-dimensional input space (grid depicted on the left). We thus obtain a low-dimensional depiction of the subdivision splines that we depict in blue for the first layer, green for the second convolutional layer, and red for the decision boundary of $6$ vs. $9$ (based on the left grid). We consistently find that only a fraction of splines are necessary to provide the turning points of final decision boundary. The middle figure visualizes the accuracy and training speeds (on a NVIDIA 2080Ti GPU) of the adopted FCNet under various pruning ratios. The general trend is that, the more nodes we prune, the faster is the training at the cost of degraded accuracy.
  • Figure 4: Visualization of spline trajectories, which mainly adapt during early phase of training demonstrating the lottery ticket hypothesis for DN partitions.
  • Figure 5: Visualization of the early-bird (EB) phenomenon, which can be leveraged to largely reduce the training costs due to the less training of costly overparametrized DNs. Each sub-figure visualizes the quantitative distance over the whole training process. Both $x$ and $y$ axis represent the epoch where we draw the binary code to represent the DN input space partition. Each point means the distance between the binary code drawn from $x$-th epoch and $y$-th epoch. The quantitative distances between consecutive epochs change rapidly in the first few training epochs (denoted by dashed red box) and remain similar after that, we then draw Spline EB tickets at such epoch, which is the very beginning of the training process (i.e., 10 $\sim$ 20 epochs), indicating both the existence of EB tickets and the effectiveness of our detector.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Theorem 1
  • Proposition 2
  • Remark 1