Table of Contents
Fetching ...

Growing Winning Subnetworks, Not Pruning Them: A Paradigm for Density Discovery in Sparse Neural Networks

Qihang Yao, Constantine Dovrolis

TL;DR

This paper reframes sparse neural network training by introducing PWMPR, a growth-based paradigm that automatically discovers operating density while training. Starting from a sparse seed, PWMPR grows connections guided by a Path Weight Magnitude Product (PWMP) score derived from path kernels, while randomization mitigates bottlenecks and a logistic-fit stopping rule halts growth at plateau. Empirical results on CIFAR, TinyImageNet, and ImageNet show PWMPR can approach IMP-derived lottery-ticket performance, but at higher densities and with substantially lower training cost (~1.5x dense vs. 3–4x for IMP-C). Overall, PWMPR demonstrates that constructive growth offers a viable, cost-efficient alternative to pruning and dynamic sparsity, opening the door to hybrid grow-prune methods and broadened applicability across architectures and domains.

Abstract

The lottery ticket hypothesis suggests that dense networks contain sparse subnetworks that can be trained in isolation to match full-model performance. Existing approaches-iterative pruning, dynamic sparse training, and pruning at initialization-either incur heavy retraining costs or assume the target density is fixed in advance. We introduce Path Weight Magnitude Product-biased Random growth (PWMPR), a constructive sparse-to-dense training paradigm that grows networks rather than pruning them, while automatically discovering their operating density. Starting from a sparse seed, PWMPR adds edges guided by path-kernel-inspired scores, mitigates bottlenecks via randomization, and stops when a logistic-fit rule detects plateauing accuracy. Experiments on CIFAR, TinyImageNet, and ImageNet show that PWMPR approaches the performance of IMP-derived lottery tickets-though at higher density-at substantially lower cost (~1.5x dense vs. 3-4x for IMP). These results establish growth-based density discovery as a promising paradigm that complements pruning and dynamic sparsity.

Growing Winning Subnetworks, Not Pruning Them: A Paradigm for Density Discovery in Sparse Neural Networks

TL;DR

This paper reframes sparse neural network training by introducing PWMPR, a growth-based paradigm that automatically discovers operating density while training. Starting from a sparse seed, PWMPR grows connections guided by a Path Weight Magnitude Product (PWMP) score derived from path kernels, while randomization mitigates bottlenecks and a logistic-fit stopping rule halts growth at plateau. Empirical results on CIFAR, TinyImageNet, and ImageNet show PWMPR can approach IMP-derived lottery-ticket performance, but at higher densities and with substantially lower training cost (~1.5x dense vs. 3–4x for IMP-C). Overall, PWMPR demonstrates that constructive growth offers a viable, cost-efficient alternative to pruning and dynamic sparsity, opening the door to hybrid grow-prune methods and broadened applicability across architectures and domains.

Abstract

The lottery ticket hypothesis suggests that dense networks contain sparse subnetworks that can be trained in isolation to match full-model performance. Existing approaches-iterative pruning, dynamic sparse training, and pruning at initialization-either incur heavy retraining costs or assume the target density is fixed in advance. We introduce Path Weight Magnitude Product-biased Random growth (PWMPR), a constructive sparse-to-dense training paradigm that grows networks rather than pruning them, while automatically discovering their operating density. Starting from a sparse seed, PWMPR adds edges guided by path-kernel-inspired scores, mitigates bottlenecks via randomization, and stops when a logistic-fit rule detects plateauing accuracy. Experiments on CIFAR, TinyImageNet, and ImageNet show that PWMPR approaches the performance of IMP-derived lottery tickets-though at higher density-at substantially lower cost (~1.5x dense vs. 3-4x for IMP). These results establish growth-based density discovery as a promising paradigm that complements pruning and dynamic sparsity.

Paper Structure

This paper contains 45 sections, 15 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Conceptual comparison of strategies for identifying sparse neural networks that match dense network performance. Blue: Iterative Magnitude Pruning (IMP) removes low-magnitude weights after repeated full training cycles. Orange: Dynamic Sparse Training (DST) starts from a random sparse topology and alternates pruning and regrowth at a fixed target density. Green: Our iterative growth method begins with a sparse network and progressively adds connections during training. The performance-density and cost-density sketches highlight the contrasting trade-offs between pruning and growth.
  • Figure 2: Illustration of the PWMP random growth algorithm, including computation of PWMP scores through forward/backward passes and sampling new connections based on the scores. This illustration uses a simplified network where all edge weights equal to 1.
  • Figure 3: Effect of growth timing on CIFAR-10 with ResNet-32. We compare three schedules: 5 epochs, 10 epochs, and an adaptive early-stopping criterion at each sparsity level. (a) Number of batches before each growth decision. (b) Accuracy after rough training. (c) Accuracy after extensive training.
  • Figure 4: Performance-density relationship of PWMPR compared with other growth mechanisms (RG and GG) and the ablated version (PWMP). (a) CIFAR-10 / ResNet-32. (b) CIFAR-100 / ResNet-56. (c) TinyImageNet / ResNet-18. (d) TinyImageNet / ViT. All experiments use the same iterative training-and-growth framework, and thus share the same training budget.
  • Figure 5: Performance-density relationship of PWMPR compared with other sparsity strategies (IMP-C, RigL, and PHEW). (a) CIFAR-10 / ResNet-32. (b) CIFAR-100 / ResNet-56. (c) TinyImageNet / ResNet-18. (d) TinyImageNet / ViT. For IMP-C, we evaluated performance over densities from 100% down to 1%. For PWMPR, PHEW, and RigL, results were collected up to approximately 50% density, for the sake of computational cost. For PHEW and RigL, we scale the number of training epochs at each density to match the cumulative training cost of PWMPR.
  • ...and 6 more figures