Cyclic Sparse Training: Is it Enough?

Advait Gadhikar; Sree Harsha Nelaturu; Rebekka Burkholz

Cyclic Sparse Training: Is it Enough?

Advait Gadhikar, Sree Harsha Nelaturu, Rebekka Burkholz

TL;DR

The paper reframes sparse-network success by arguing that repeated cyclic training primarily improves optimization rather than solely enabling better mask learning or pruning-induced regularization. It demonstrates that pruning-at-initialization (PaI) methods gain significantly from cyclic training, sometimes surpassing standard iterative pruning, but high sparsity requires a strong coupling between parameter initialization and the sparse mask. To address this, the authors introduce SCULPT-ing, which couples sparse cyclic training with a single magnitude-based pruning step to align the mask and learned parameters, achieving competitive performance with substantially reduced computation. Empirically, cyclic training boosts PaI masks across datasets, while SCULPT-ing bridges the gap to state-of-the-art iterative pruning at high sparsity and offers practical gains in memory and compute. The work highlights optimization dynamics as a core factor in sparse training and provides a scalable pathway toward competitive sparse networks from scratch.

Abstract

The success of iterative pruning methods in achieving state-of-the-art sparse networks has largely been attributed to improved mask identification and an implicit regularization induced by pruning. We challenge this hypothesis and instead posit that their repeated cyclic training schedules enable improved optimization. To verify this, we show that pruning at initialization is significantly boosted by repeated cyclic training, even outperforming standard iterative pruning methods. The dominant mechanism how this is achieved, as we conjecture, can be attributed to a better exploration of the loss landscape leading to a lower training loss. However, at high sparsity, repeated cyclic training alone is not enough for competitive performance. A strong coupling between learnt parameter initialization and mask seems to be required. Standard methods obtain this coupling via expensive pruning-training iterations, starting from a dense network. To achieve this with sparse training instead, we propose SCULPT-ing, i.e., repeated cyclic training of any sparse mask followed by a single pruning step to couple the parameters and the mask, which is able to match the performance of state-of-the-art iterative pruning methods in the high sparsity regime at reduced computational cost.

Cyclic Sparse Training: Is it Enough?

TL;DR

Abstract

Paper Structure (16 sections, 17 figures, 3 tables)

This paper contains 16 sections, 17 figures, 3 tables.

Introduction
Background and related work
Repeated cyclic sparse training
Does the mask matter?
SCULPT-ing
Discussion
Acknowledgements
Appendix
Improved sign recovery by cyclic training.
Experimental Setup
Iterative Magnitude Pruning (IMP)
Weight Rewinding (WR)
Learning Rate Rewinding (LRR)
Training iterations for cyclic training and LRR.
ERK vs Balanced sparsity ratios
...and 1 more sections

Figures (17)

Figure 1: (a) Illustration of iterative pruning (top), cyclic training (middle) and SCULPT-ing (bottom). (b) Iterative pruning improves generalization on CIFAR10 (left) and ImageNet (right). Shaded area denotes gain in performance for dense networks with cyclic training.
Figure 2: (a) Improved generalization by cyclic training of a sparse mask (with sparsity $67\%$ and $90\%$) and a dense network. (b) Cyclic training improves over training with a one-cycle learning rate schedule and for a dense network and a random sparse network with $90\%$ sparsity. (c) Maximum eigenvalues of the Hessian of the loss function for a dense network (solid lines) and random sparse network with $90\%$ sparsity (dotted lines). Results are reported for CIFAR10.
Figure 3: Cyclic training of a random sparse mask on CIFAR10 with a ResNet20 at different sparsities. (a) Number of sign flips during training. (b) Linear mode connectivity of the test loss and (c) train loss after consecutive training cycles.
Figure 4: Cyclic training boosts performance of any sparse mask including a random one and even outperforms LRR at low sparsity. Shaded region highlights the gain in performance of a dense network by cyclic training for reference. Solid lines denote results with cyclic training and dotted lines show standard training for PaI methods.
Figure 5: Comparing cyclic training with different combinations of sparse masks and parameter initializations to iterative pruning methods LRR, WR and IMP.
...and 12 more figures

Cyclic Sparse Training: Is it Enough?

TL;DR

Abstract

Cyclic Sparse Training: Is it Enough?

Authors

TL;DR

Abstract

Table of Contents

Figures (17)