Table of Contents
Fetching ...

Rigging the Lottery: Making All Tickets Winners

Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, Erich Elsen

TL;DR

This work tackles the inefficiency of training sparse neural networks by introducing RigL, a dynamic sparse-training method that updates network connectivity during optimization based on weight magnitudes and gradient information, keeping memory and compute proportional to the current density. RigL achieves state-of-the-art accuracy within fixed FLOP budgets across Vision and NLP tasks, often outperforming dense-to-sparse and static-sparse baselines, and offers insights into why allowing topology to change helps navigate the loss landscape. The authors provide extensive empirical evaluation on ImageNet, CIFAR-10, and WikiText-103, perform systematic ablations, and show that gradient-guided growth plus controlled update schedules consistently improves performance, with ERK sparsity distributions frequently yielding the best results. The work also discusses practical implications for deploying very large sparse models and points toward future hardware that can better support sparse computation.

Abstract

Many applications require sparse neural networks due to space or inference time restrictions. There is a large body of work on training dense networks to yield sparse networks for inference, but this limits the size of the largest trainable sparse model to that of the largest trainable dense model. In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods. Our method updates the topology of the sparse network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. We demonstrate state-of-the-art sparse training results on a variety of networks and datasets, including ResNet-50, MobileNets on Imagenet-2012, and RNNs on WikiText-103. Finally, we provide some insights into why allowing the topology to change during the optimization can overcome local minima encountered when the topology remains static. Code used in our work can be found in github.com/google-research/rigl.

Rigging the Lottery: Making All Tickets Winners

TL;DR

This work tackles the inefficiency of training sparse neural networks by introducing RigL, a dynamic sparse-training method that updates network connectivity during optimization based on weight magnitudes and gradient information, keeping memory and compute proportional to the current density. RigL achieves state-of-the-art accuracy within fixed FLOP budgets across Vision and NLP tasks, often outperforming dense-to-sparse and static-sparse baselines, and offers insights into why allowing topology to change helps navigate the loss landscape. The authors provide extensive empirical evaluation on ImageNet, CIFAR-10, and WikiText-103, perform systematic ablations, and show that gradient-guided growth plus controlled update schedules consistently improves performance, with ERK sparsity distributions frequently yielding the best results. The work also discusses practical implications for deploying very large sparse models and points toward future hardware that can better support sparse computation.

Abstract

Many applications require sparse neural networks due to space or inference time restrictions. There is a large body of work on training dense networks to yield sparse networks for inference, but this limits the size of the largest trainable sparse model to that of the largest trainable dense model. In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods. Our method updates the topology of the sparse network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. We demonstrate state-of-the-art sparse training results on a variety of networks and datasets, including ResNet-50, MobileNets on Imagenet-2012, and RNNs on WikiText-103. Finally, we provide some insights into why allowing the topology to change during the optimization can overcome local minima encountered when the topology remains static. Code used in our work can be found in github.com/google-research/rigl.

Paper Structure

This paper contains 25 sections, 2 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: RigL improves the optimization of sparse neural networks by leveraging weight magnitude and gradient information to jointly optimize model parameters and connectivity.
  • Figure 2: (left) Performance and cost of training 80% and 90% sparse ResNet-50s on the Imagenet-2012 classification task. We report FLOPs needed for training and test (inference on single sample) and normalize them with the FLOPs of a dense model. To make a fair comparison we assume pruning algorithm utilizes sparsity during the training (see Appendix \ref{['app:flops']} for details on how FLOPs are calculated). Methods with superscript '*' indicates reported results in corresponding papers (except DNW results, which is obtained from kusupati2020str). Pruning results are obtained from gale2019state. (top-right) Performance of sparse training methods on training 80% sparse ResNet-50 with uniform sparsity distribution. Points at each curve correspond to the individual training runs with training multipliers from 1 to 5 (except pruning which is scaled between 0.5 and 2). The number of FLOPs required to train a standard dense ResNet-50 along with its performance is indicated with a dashed red line. (bottom-right) Performance of RigL at different sparsity levels with extended training.
  • Figure 3: (left)RigL significantly improves the performance of sparse MobileNets (v1 and v2) on ImageNet-2012 dataset and exceeds the pruning results reported by gupta2018. Performance of the dense MobileNets are indicated with red lines. (right) Performance of sparse MobileNet-v1 architectures presented with their inference FLOPs. Networks with ERK distribution get better performance with the same number of parameters but take more FLOPs to run. Training wider sparse models with RigL (Big-Sparse) yields a significant performance improvement over the dense model.
  • Figure 4: (left) Final validation loss of various sparse training methods on character level language modeling task. Cross entropy loss is converted to bits (from nats). (right) Test accuracies of sparse WideResNet-22-2's on CIFAR-10 task.
  • Figure 5: Effect of (left) sparsity distribution and (right) update schedule ($\Delta T$, $\alpha$) on the final performance.
  • ...and 7 more figures