Pruning Neural Networks at Initialization: Why are We Missing the Mark?

Jonathan Frankle; Gintare Karolina Dziugaite; Daniel M. Roy; Michael Carbin

Pruning Neural Networks at Initialization: Why are We Missing the Mark?

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin

TL;DR

This paper evaluates pruning-at-initialization methods (SNIP, GraSP, SynFlow) and magnitude pruning across multiple architectures to determine whether meaningful subnetworks can be found before training. It shows that while these methods outperform random pruning, they generally do not reach the performance of magnitude pruning after training, and their efficacy largely rests on per-layer pruning proportions rather than weight-specific decisions. Through extensive ablations, the authors demonstrate insensitivity to weight-level shuffling or reinitialization at initialization, suggesting fundamental limits to these heuristics when pruning at the outset. The study further reveals that pruning after initialization and continuing training improves performance more rapidly, implying that pruning signals tied to later training stages or dynamic masking may be necessary for competitive early pruning. Overall, pruning at initialization remains a tradeoff-heavy approach, guiding future work toward new signals or strategies that could enable cost savings without compromising final accuracy.

Abstract

Recent work has explored the possibility of pruning neural networks at initialization. We assess proposals for doing so: SNIP (Lee et al., 2019), GraSP (Wang et al., 2020), SynFlow (Tanaka et al., 2020), and magnitude pruning. Although these methods surpass the trivial baseline of random pruning, they remain below the accuracy of magnitude pruning after training, and we endeavor to understand why. We show that, unlike pruning after training, randomly shuffling the weights these methods prune within each layer or sampling new initial values preserves or improves accuracy. As such, the per-weight pruning decisions made by these methods can be replaced by a per-layer choice of the fraction of weights to prune. This property suggests broader challenges with the underlying pruning heuristics, the desire to prune at initialization, or both.

Pruning Neural Networks at Initialization: Why are We Missing the Mark?

TL;DR

Abstract

Pruning Neural Networks at Initialization: Why are We Missing the Mark?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)