Table of Contents
Fetching ...

Pruning Neural Networks at Initialization: Why are We Missing the Mark?

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin

TL;DR

This paper evaluates pruning-at-initialization methods (SNIP, GraSP, SynFlow) and magnitude pruning across multiple architectures to determine whether meaningful subnetworks can be found before training. It shows that while these methods outperform random pruning, they generally do not reach the performance of magnitude pruning after training, and their efficacy largely rests on per-layer pruning proportions rather than weight-specific decisions. Through extensive ablations, the authors demonstrate insensitivity to weight-level shuffling or reinitialization at initialization, suggesting fundamental limits to these heuristics when pruning at the outset. The study further reveals that pruning after initialization and continuing training improves performance more rapidly, implying that pruning signals tied to later training stages or dynamic masking may be necessary for competitive early pruning. Overall, pruning at initialization remains a tradeoff-heavy approach, guiding future work toward new signals or strategies that could enable cost savings without compromising final accuracy.

Abstract

Recent work has explored the possibility of pruning neural networks at initialization. We assess proposals for doing so: SNIP (Lee et al., 2019), GraSP (Wang et al., 2020), SynFlow (Tanaka et al., 2020), and magnitude pruning. Although these methods surpass the trivial baseline of random pruning, they remain below the accuracy of magnitude pruning after training, and we endeavor to understand why. We show that, unlike pruning after training, randomly shuffling the weights these methods prune within each layer or sampling new initial values preserves or improves accuracy. As such, the per-weight pruning decisions made by these methods can be replaced by a per-layer choice of the fraction of weights to prune. This property suggests broader challenges with the underlying pruning heuristics, the desire to prune at initialization, or both.

Pruning Neural Networks at Initialization: Why are We Missing the Mark?

TL;DR

This paper evaluates pruning-at-initialization methods (SNIP, GraSP, SynFlow) and magnitude pruning across multiple architectures to determine whether meaningful subnetworks can be found before training. It shows that while these methods outperform random pruning, they generally do not reach the performance of magnitude pruning after training, and their efficacy largely rests on per-layer pruning proportions rather than weight-specific decisions. Through extensive ablations, the authors demonstrate insensitivity to weight-level shuffling or reinitialization at initialization, suggesting fundamental limits to these heuristics when pruning at the outset. The study further reveals that pruning after initialization and continuing training improves performance more rapidly, implying that pruning signals tied to later training stages or dynamic masking may be necessary for competitive early pruning. Overall, pruning at initialization remains a tradeoff-heavy approach, guiding future work toward new signals or strategies that could enable cost savings without compromising final accuracy.

Abstract

Recent work has explored the possibility of pruning neural networks at initialization. We assess proposals for doing so: SNIP (Lee et al., 2019), GraSP (Wang et al., 2020), SynFlow (Tanaka et al., 2020), and magnitude pruning. Although these methods surpass the trivial baseline of random pruning, they remain below the accuracy of magnitude pruning after training, and we endeavor to understand why. We show that, unlike pruning after training, randomly shuffling the weights these methods prune within each layer or sampling new initial values preserves or improves accuracy. As such, the per-weight pruning decisions made by these methods can be replaced by a per-layer choice of the fraction of weights to prune. This property suggests broader challenges with the underlying pruning heuristics, the desire to prune at initialization, or both.

Paper Structure

This paper contains 55 sections, 12 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Weights remaining at each training step for methods that reach accuracy within one percentage point of ResNet-50 on ImageNet. Dashed line is a result that is achieved retroactively.
  • Figure 2: Comparisons in the SNIP, GraSP, and SynFlow papers. Does not include MNIST. SNIP lacks baselines beyond MNIST. GraSP includes random, LTR, and other methods; it lacks magnitude at init and ablations. SynFlow has other methods at init but lacks baselines or ablations.
  • Figure 3: Accuracy of early pruning methods when pruning at initialization to various sparsities.
  • Figure 4: Ablations on subnetworks found by applying magnitude pruning, SNIP, GraSP, and SynFlow at initialization. (We ran limited ablations on ResNet-50 due to resource limitations.)
  • Figure 5: Percent of neurons (conv. channels) with sparsity $\geq s\%$ at the highest matching sparsity.
  • ...and 11 more figures