Early Transformers: A study on Efficient Training of Transformer Models through Early-Bird Lottery Tickets
Shravan Cheekati
TL;DR
The paper tackles the high computational cost of Transformer training by testing the existence of early-bird subnetworks across both vision and language models. It introduces a method that combines iterative pruning, a masked distance metric, and selective retraining, applying thresholds $p_{vision}=0.1$ and $p_{text}=0.01$ to identify early-bird tickets in ViT, Swin-T, GPT-2, and RoBERTa. Empirically, early-bird tickets emerge within the initial epochs and yield comparable or superior accuracy while substantially reducing memory usage, with model-specific pruning ratios. The findings demonstrate the generalizability of the early-bird ticket phenomenon across Transformer architectures and tasks, offering a practical pathway to more efficient and accessible training of large-scale models.
Abstract
The training of Transformer models has revolutionized natural language processing and computer vision, but it remains a resource-intensive and time-consuming process. This paper investigates the applicability of the early-bird ticket hypothesis to optimize the training efficiency of Transformer models. We propose a methodology that combines iterative pruning, masked distance calculation, and selective retraining to identify early-bird tickets in various Transformer architectures, including ViT, Swin-T, GPT-2, and RoBERTa. Our experimental results demonstrate that early-bird tickets can be consistently found within the first few epochs of training or fine-tuning, enabling significant resource optimization without compromising performance. The pruned models obtained from early-bird tickets achieve comparable or even superior accuracy to their unpruned counterparts while substantially reducing memory usage. Furthermore, our comparative analysis highlights the generalizability of the early-bird ticket phenomenon across different Transformer models and tasks. This research contributes to the development of efficient training strategies for Transformer models, making them more accessible and resource-friendly. By leveraging early-bird tickets, practitioners can accelerate the progress of natural language processing and computer vision applications while reducing the computational burden associated with training Transformer models.
