Table of Contents
Fetching ...

Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks

Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, Yingyan Celine Lin

TL;DR

This work tackles the high training cost of deep networks by proposing Early-Bird (EB) tickets—subnetworks that can be identified early in training under low-cost schemes and still achieve final accuracies comparable to full training. It introduces a mask-distance metric to detect the emergence of EB tickets without full training, and a practical EB Train framework that identifies EB tickets early and retrains only those subnetworks. Empirical results show substantial savings: 2.2–2.4x reductions in training FLOPs and up to 24.6x energy savings (with FP8 search/retrain), while maintaining or improving accuracy across CIFAR-10/100 and ImageNet-scale models. The approach offers a readily adoptable path to cost-efficient training and highlights the value of early connectivity patterns in guiding efficient learning.

Abstract

(Frankle & Carbin, 2019) shows that there exist winning tickets (small but critical subnetworks) for dense, randomly initialized networks, that can be trained alone to achieve comparable accuracies to the latter in a similar number of iterations. However, the identification of these winning tickets still requires the costly train-prune-retrain process, limiting their practical benefits. In this paper, we discover for the first time that the winning tickets can be identified at the very early training stage, which we term as early-bird (EB) tickets, via low-cost training schemes (e.g., early stopping and low-precision training) at large learning rates. Our finding of EB tickets is consistent with recently reported observations that the key connectivity patterns of neural networks emerge early. Furthermore, we propose a mask distance metric that can be used to identify EB tickets with low computational overhead, without needing to know the true winning tickets that emerge after the full training. Finally, we leverage the existence of EB tickets and the proposed mask distance to develop efficient training methods, which are achieved by first identifying EB tickets via low-cost schemes, and then continuing to train merely the EB tickets towards the target accuracy. Experiments based on various deep networks and datasets validate: 1) the existence of EB tickets, and the effectiveness of mask distance in efficiently identifying them; and 2) that the proposed efficient training via EB tickets can achieve up to 4.7x energy savings while maintaining comparable or even better accuracy, demonstrating a promising and easily adopted method for tackling cost-prohibitive deep network training. Code available at https://github.com/RICE-EIC/Early-Bird-Tickets.

Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks

TL;DR

This work tackles the high training cost of deep networks by proposing Early-Bird (EB) tickets—subnetworks that can be identified early in training under low-cost schemes and still achieve final accuracies comparable to full training. It introduces a mask-distance metric to detect the emergence of EB tickets without full training, and a practical EB Train framework that identifies EB tickets early and retrains only those subnetworks. Empirical results show substantial savings: 2.2–2.4x reductions in training FLOPs and up to 24.6x energy savings (with FP8 search/retrain), while maintaining or improving accuracy across CIFAR-10/100 and ImageNet-scale models. The approach offers a readily adoptable path to cost-efficient training and highlights the value of early connectivity patterns in guiding efficient learning.

Abstract

(Frankle & Carbin, 2019) shows that there exist winning tickets (small but critical subnetworks) for dense, randomly initialized networks, that can be trained alone to achieve comparable accuracies to the latter in a similar number of iterations. However, the identification of these winning tickets still requires the costly train-prune-retrain process, limiting their practical benefits. In this paper, we discover for the first time that the winning tickets can be identified at the very early training stage, which we term as early-bird (EB) tickets, via low-cost training schemes (e.g., early stopping and low-precision training) at large learning rates. Our finding of EB tickets is consistent with recently reported observations that the key connectivity patterns of neural networks emerge early. Furthermore, we propose a mask distance metric that can be used to identify EB tickets with low computational overhead, without needing to know the true winning tickets that emerge after the full training. Finally, we leverage the existence of EB tickets and the proposed mask distance to develop efficient training methods, which are achieved by first identifying EB tickets via low-cost schemes, and then continuing to train merely the EB tickets towards the target accuracy. Experiments based on various deep networks and datasets validate: 1) the existence of EB tickets, and the effectiveness of mask distance in efficiently identifying them; and 2) that the proposed efficient training via EB tickets can achieve up to 4.7x energy savings while maintaining comparable or even better accuracy, demonstrating a promising and easily adopted method for tackling cost-prohibitive deep network training. Code available at https://github.com/RICE-EIC/Early-Bird-Tickets.

Paper Structure

This paper contains 18 sections, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Retraining accuracy vs. epoch numbers at which the subnetworks are drawn, for both PreResNet101 and VGG16 on the CIFAR-10/100 datasets, where $p$ indicates the channel pruning ratio and the dashed line shows the accuracy of the corresponding dense model on the same dataset, $\medwhitestar$ denotes the retraining accuracies of subnetworks drawn from the epochs with the best search accuracies, and error bars show the minimum and maximum of three runs.
  • Figure 2: Retraining accuracy and total training FLOPs comparison vs. epoch number at which the subnetwork is drawn, when using 8 bits precision during the stage of identifying EB tickets based on the VGG16 model and CIFAR-10/100 datasets, where $p$ indicates the channel-wise pruning ratio and the dashed line shows the accuracy of the corresponding dense model on the same dataset.
  • Figure 3: $\!\!\!\!\!$ Visualization of the pairwise mask distance matrix for VGG16 and PreResNet101 on CIFAR-100.
  • Figure 4: A high-level overview of the commonly adopted progressive pruning and training scheme and our EB Train.
  • Figure 5: The total training FLOPs vs. the epochs at which the subnetworks are drawn from, for both the PreResNet101 and VGG16 models on the CIFAR-10 and CIFAR-100 datasets, where $p$ indicates the channel-wise pruning ratio for extracting the subnetworks. Note that the EB tickets at all cases achieve comparable or higher accuracies and consume less FLOPs than those of the "ground-truth" winning tickets (drawn after the full training of 160 epochs).
  • ...and 1 more figures