Picking Winning Tickets Before Training by Preserving Gradient Flow
Chaoqi Wang, Guodong Zhang, Roger Grosse
TL;DR
The paper tackles the high training cost of overparameterized networks by proposing foresight pruning at initialization. It introduces Gradient Signal Preservation (GraSP), a gradient-flow–based criterion that uses Hessian-gradient interactions and NTK insights to select which weights to keep. Empirical results across CIFAR, Tiny-ImageNet, and ImageNet with VGG and ResNet show GraSP outperforms SNIP at extreme sparsity and can achieve substantial pruning (e.g., 80% on VGG-16/ImageNet) with minimal accuracy loss. This approach suggests practical pathways to training extremely sparse networks with hardware-friendly efficiency, while also highlighting areas for future optimization strategies in sparse training.
Abstract
Overparameterization has been shown to benefit both the optimization and generalization of neural networks, but large networks are resource hungry at both training and test time. Network pruning can reduce test-time resource requirements, but is typically applied to trained networks and therefore cannot avoid the expensive training process. We aim to prune networks at initialization, thereby saving resources at training time as well. Specifically, we argue that efficient training requires preserving the gradient flow through the network. This leads to a simple but effective pruning criterion we term Gradient Signal Preservation (GraSP). We empirically investigate the effectiveness of the proposed method with extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet, using VGGNet and ResNet architectures. Our method can prune 80% of the weights of a VGG-16 network on ImageNet at initialization, with only a 1.6% drop in top-1 accuracy. Moreover, our method achieves significantly better performance than the baseline at extreme sparsity levels.
