Masks, Signs, And Learning Rate Rewinding
Advait Gadhikar, Rebekka Burkholz
TL;DR
This work addresses how Learning Rate Rewinding (LRR) improves sparse network training by decoupling mask identification from parameter optimization. The authors develop a theoretical analysis of gradient-flow dynamics for a minimal two-layer ReLU model, illustrating that LRR can inherit advantageous weight signs from an overparameterized phase and thus identify the ground-truth mask more reliably than Iterative Magnitude Pruning (IMP). They prove that for a single hidden neuron, LRR is more likely to converge to the target under reasonable initializations, and that overparameterization (higher input dimension) enables sign switches that further bolster learning. Empirically, the study validates these insights on CIFAR-10/100, Tiny ImageNet, and ImageNet with ResNet architectures, showing that LRR outperforms IMP across masks (including random ones) and sparsities, especially when BN rewinding and warmup are employed. The results suggest that preserving and propagating sign information through pruning iterations yields a more robust sparse-training paradigm and could guide the design of practical sparse training algorithms that operate from scratch.
Abstract
Learning Rate Rewinding (LRR) has been established as a strong variant of Iterative Magnitude Pruning (IMP) to find lottery tickets in deep overparameterized neural networks. While both iterative pruning schemes couple structure and parameter learning, understanding how LRR excels in both aspects can bring us closer to the design of more flexible deep learning algorithms that can optimize diverse sets of sparse architectures. To this end, we conduct experiments that disentangle the effect of mask learning and parameter optimization and how both benefit from overparameterization. The ability of LRR to flip parameter signs early and stay robust to sign perturbations seems to make it not only more effective in mask identification but also in optimizing diverse sets of masks, including random ones. In support of this hypothesis, we prove in a simplified single hidden neuron setting that LRR succeeds in more cases than IMP, as it can escape initially problematic sign configurations.
