Early Transformers: A study on Efficient Training of Transformer Models through Early-Bird Lottery Tickets

Shravan Cheekati

Early Transformers: A study on Efficient Training of Transformer Models through Early-Bird Lottery Tickets

Shravan Cheekati

TL;DR

The paper tackles the high computational cost of Transformer training by testing the existence of early-bird subnetworks across both vision and language models. It introduces a method that combines iterative pruning, a masked distance metric, and selective retraining, applying thresholds $p_{vision}=0.1$ and $p_{text}=0.01$ to identify early-bird tickets in ViT, Swin-T, GPT-2, and RoBERTa. Empirically, early-bird tickets emerge within the initial epochs and yield comparable or superior accuracy while substantially reducing memory usage, with model-specific pruning ratios. The findings demonstrate the generalizability of the early-bird ticket phenomenon across Transformer architectures and tasks, offering a practical pathway to more efficient and accessible training of large-scale models.

Abstract

The training of Transformer models has revolutionized natural language processing and computer vision, but it remains a resource-intensive and time-consuming process. This paper investigates the applicability of the early-bird ticket hypothesis to optimize the training efficiency of Transformer models. We propose a methodology that combines iterative pruning, masked distance calculation, and selective retraining to identify early-bird tickets in various Transformer architectures, including ViT, Swin-T, GPT-2, and RoBERTa. Our experimental results demonstrate that early-bird tickets can be consistently found within the first few epochs of training or fine-tuning, enabling significant resource optimization without compromising performance. The pruned models obtained from early-bird tickets achieve comparable or even superior accuracy to their unpruned counterparts while substantially reducing memory usage. Furthermore, our comparative analysis highlights the generalizability of the early-bird ticket phenomenon across different Transformer models and tasks. This research contributes to the development of efficient training strategies for Transformer models, making them more accessible and resource-friendly. By leveraging early-bird tickets, practitioners can accelerate the progress of natural language processing and computer vision applications while reducing the computational burden associated with training Transformer models.

Early Transformers: A study on Efficient Training of Transformer Models through Early-Bird Lottery Tickets

TL;DR

and

to identify early-bird tickets in ViT, Swin-T, GPT-2, and RoBERTa. Empirically, early-bird tickets emerge within the initial epochs and yield comparable or superior accuracy while substantially reducing memory usage, with model-specific pruning ratios. The findings demonstrate the generalizability of the early-bird ticket phenomenon across Transformer architectures and tasks, offering a practical pathway to more efficient and accessible training of large-scale models.

Abstract

Paper Structure (13 sections, 3 figures, 1 table)

This paper contains 13 sections, 3 figures, 1 table.

Introduction
Related Work
Methodology
Experiments
Experimental Setup
Results and Analysis
ViT
Swin-T
GPT-2
RoBERTa
Memory Usage
Discussion
Conclusion

Figures (3)

Figure 1: Comparison of Transformer training methods
Figure 2: Accuracy plots for each model.
Figure 3: Heatmaps and mask distance plots at p = 0.1 (to the left) and 0.3 (to the right) for all models

Early Transformers: A study on Efficient Training of Transformer Models through Early-Bird Lottery Tickets

TL;DR

Abstract

Early Transformers: A study on Efficient Training of Transformer Models through Early-Bird Lottery Tickets

Authors

TL;DR

Abstract

Table of Contents

Figures (3)