Table of Contents
Fetching ...

A Signal Propagation Perspective for Pruning Neural Networks at Initialization

Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, Philip H. S. Torr

TL;DR

This work analyzes pruning neural networks at initialization through a signal propagation lens, formalizing initialization conditions that enable reliable connection sensitivity measurements via layerwise dynamical isometry. It shows that faithful gradient propagation, governed by the Jacobians, is essential for effective pruning and that pruning can disrupt dynamical isometry in sparse networks. To mitigate this, the authors propose a data-free method to recover approximate dynamical isometry (LDI-AI), improving trainability of pruned networks across architectures and datasets. They also demonstrate unsupervised pruning and neural architecture sculpting, revealing that architectures sculpted from oversized networks can outperform hand-designed baselines under the same parameter budget. Overall, the paper provides a principled framework linking initialization, pruning, and trainability, with practical implications for scalable, sparse neural networks and potential routes toward winning lottery ticket-like initializations.

Abstract

Network pruning is a promising avenue for compressing deep neural networks. A typical approach to pruning starts by training a model and then removing redundant parameters while minimizing the impact on what is learned. Alternatively, a recent approach shows that pruning can be done at initialization prior to training, based on a saliency criterion called connection sensitivity. However, it remains unclear exactly why pruning an untrained, randomly initialized neural network is effective. In this work, by noting connection sensitivity as a form of gradient, we formally characterize initialization conditions to ensure reliable connection sensitivity measurements, which in turn yields effective pruning results. Moreover, we analyze the signal propagation properties of the resulting pruned networks and introduce a simple, data-free method to improve their trainability. Our modifications to the existing pruning at initialization method lead to improved results on all tested network models for image classification tasks. Furthermore, we empirically study the effect of supervision for pruning and demonstrate that our signal propagation perspective, combined with unsupervised pruning, can be useful in various scenarios where pruning is applied to non-standard arbitrarily-designed architectures.

A Signal Propagation Perspective for Pruning Neural Networks at Initialization

TL;DR

This work analyzes pruning neural networks at initialization through a signal propagation lens, formalizing initialization conditions that enable reliable connection sensitivity measurements via layerwise dynamical isometry. It shows that faithful gradient propagation, governed by the Jacobians, is essential for effective pruning and that pruning can disrupt dynamical isometry in sparse networks. To mitigate this, the authors propose a data-free method to recover approximate dynamical isometry (LDI-AI), improving trainability of pruned networks across architectures and datasets. They also demonstrate unsupervised pruning and neural architecture sculpting, revealing that architectures sculpted from oversized networks can outperform hand-designed baselines under the same parameter budget. Overall, the paper provides a principled framework linking initialization, pruning, and trainability, with practical implications for scalable, sparse neural networks and potential routes toward winning lottery ticket-like initializations.

Abstract

Network pruning is a promising avenue for compressing deep neural networks. A typical approach to pruning starts by training a model and then removing redundant parameters while minimizing the impact on what is learned. Alternatively, a recent approach shows that pruning can be done at initialization prior to training, based on a saliency criterion called connection sensitivity. However, it remains unclear exactly why pruning an untrained, randomly initialized neural network is effective. In this work, by noting connection sensitivity as a form of gradient, we formally characterize initialization conditions to ensure reliable connection sensitivity measurements, which in turn yields effective pruning results. Moreover, we analyze the signal propagation properties of the resulting pruned networks and introduce a simple, data-free method to improve their trainability. Our modifications to the existing pruning at initialization method lead to improved results on all tested network models for image classification tasks. Furthermore, we empirically study the effect of supervision for pruning and demonstrate that our signal propagation perspective, combined with unsupervised pruning, can be useful in various scenarios where pruning is applied to non-standard arbitrarily-designed architectures.

Paper Structure

This paper contains 17 sections, 1 theorem, 12 equations, 8 figures, 6 tables.

Key Result

Proposition 1

Let ${\mathbf{\epsilon}} = \partial L/\partial {\mathbf{x}}^K$ denote the error signal and ${\mathbf{x}}^0$ denote the input signal. Then,

Figures (8)

  • Figure 1: (left) layerwise sparsity patterns $c \in \{0,1\}^{100\times100}$ obtained as a result of pruning for the sparsity level $\bar{\kappa}=\{10,..,90\}$%. Here, black($0$)/white($1$) pixels refer to pruned/retained parameters; (right) connection sensitivities (cs) measured for the parameters in each layer. All networks are initialized with $\gamma=1.0$. Unlike the linear case, the sparsity pattern for the tanh network is non-uniform over different layers. When pruning for a high sparsity level (e.g., $\bar{\kappa}=90$%), this becomes critical and leads to poor learning capability as there are only a few parameters left in later layers. This is explained by the connection sensitivity plot which shows that for the nonlinear network parameters in later layers have saturating, lower connection sensitivities than those in earlier layers.
  • Figure 2: (a) Signal propagation (mean Jacobian singular values) in sparse networks pruned for varying sparsity levels $\bar{\kappa}$, and (b) training behavior of the sparse network at $\bar{\kappa}=90$%. Signal propagation, pruning scheme, and overparameterization affect trainability of sparse neural networks. We train using SGD with the initial learning rate of 0.1 decayed by 1/10 at every 20k iterations. All results are the average over 10 runs. We provide other singular value statistics (max, min, std), accuracy plot, and extended training results for random and magnitude pruning in Appendix \ref{['sec:signal-propagation-training']}.
  • Figure 3: Neural architecture sculpting results on CIFAR-10. We report generalization errors (avg. over $5$ runs). All networks have the same number of parameters (269k) and trained identically.
  • Figure 4: Full results for (a) signal propagation (all signular value statistics), and (b) training behavior (including accuracy) for 7-layer linear and tanh MLP networks. We provide results of LDI-Rand, LDI-Rand-AI, VS-CS, LDI-CS, LDI-CS-AI on the linear case for both singular value statistics and training log. We also plot results of LDI-Mag and LDI-Dense on the tanh case for trainability; the training results of non-pruned (LDI-Dense) and magnitude (LDI-Mag) pruning are only reported for the tanh case, because the learning rate had to be lowered for the linear case (otherwise it explodes), which makes the comparison not entirely fair. We provide the singular value statistics for the magnitude pruning in Figure \ref{['fig:signal-propagation-sparse-more-mag']} to avoid clutter. Also, extended training logs for random and magnitude based pruning are provided separately in Figure \ref{['fig:signal-propagation-sparse-extended']} to illustrate the difference in convergence speed.
  • Figure 5: Extended training log (i.e., Loss and Accuracy) for random (Rand) and magnitude (Mag) pruning. The sparse networks obtained by random or magnitude pruning take a much longer time to train than that obtained by pruning based on connection sensitivity. All methods are pruned at the layerwise orthogonal initialization, and trained the same way as before.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Definition 1
  • proof