Finding Stable Subnetworks at Initialization with Dataset Distillation
Luke McDermott, Rahul Parhi
TL;DR
This work addresses finding stable subnetworks at initialization by leveraging dataset distillation to create a compact synthetic training set. The authors introduce Distilled Pruning, which uses distilled data in the inner loop of iterative magnitude pruning to obtain stable subnetworks from unstable dense initializations, and show that these synthetic subnetworks can match the performance of traditional lottery tickets on CIFAR-10 with ResNet-18 using far fewer training points. They further demonstrate that combining distilled pruning with IMP yields lottery tickets at high sparsities, including notable gains on ImageNet subsets, and provide supporting analyses—linear mode connectivity, loss-landscape visualizations, and Hessian diagnostics—that distilled subnetworks exhibit greater stability and smoother loss surfaces than conventional IMP. The findings suggest that distilled data can guide pruning dynamics from initialization, enabling efficient, high-sparsity lottery tickets and informing future data-centric pruning strategies. Practical impact includes potential reductions in training data needs and computational costs for discovering and training sparse subnetworks."
Abstract
Recent works have shown that Dataset Distillation, the process for summarizing the training data, can be leveraged to accelerate the training of deep learning models. However, its impact on training dynamics, particularly in neural network pruning, remains largely unexplored. In our work, we use distilled data in the inner loop of iterative magnitude pruning to produce sparse, trainable subnetworks at initialization -- more commonly known as lottery tickets. While using 150x less training points, our algorithm matches the performance of traditional lottery ticket rewinding on ResNet-18 & CIFAR-10. Previous work highlights that lottery tickets can be found when the dense initialization is stable to SGD noise (i.e. training across different ordering of the data converges to the same minima). We extend this discovery, demonstrating that stable subnetworks can exist even within an unstable dense initialization. In our linear mode connectivity studies, we find that pruning with distilled data discards parameters that contribute to the sharpness of the loss landscape. Lastly, we show that by first generating a stable sparsity mask at initialization, we can find lottery tickets at significantly higher sparsities than traditional iterative magnitude pruning.
