Table of Contents
Fetching ...

When BERT Plays the Lottery, All Tickets Are Winning

Sai Prasanna, Anna Rogers, Anna Rumshisky

TL;DR

This work tests the lottery ticket hypothesis in fine-tuned BERT using magnitude-based and structured pruning on GLUE tasks. It identifies 'good' subnetworks that retain ~90% of full-model performance and finds that many 'bad' or randomly pruned subnetworks can still be trained to strong performance, highlighting the broad usefulness of pre-trained weights. Importantly, the 'good' subnetworks are not stable across random seeds and do not appear to encode linguistic knowledge in a straightforward way, suggesting optimization dynamics rather than isolated linguistic bits explain BERT's success. The results imply that pre-trained transformers leverage a favorable loss landscape, with substantial interaction between self-attention heads and MLPs and only partial transfer of linguistic structure via individual components.

Abstract

Large Transformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained BERT weights are potentially useful. We also study the "good" subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns.

When BERT Plays the Lottery, All Tickets Are Winning

TL;DR

This work tests the lottery ticket hypothesis in fine-tuned BERT using magnitude-based and structured pruning on GLUE tasks. It identifies 'good' subnetworks that retain ~90% of full-model performance and finds that many 'bad' or randomly pruned subnetworks can still be trained to strong performance, highlighting the broad usefulness of pre-trained weights. Importantly, the 'good' subnetworks are not stable across random seeds and do not appear to encode linguistic knowledge in a straightforward way, suggesting optimization dynamics rather than isolated linguistic bits explain BERT's success. The results imply that pre-trained transformers leverage a favorable loss landscape, with substantial interaction between self-attention heads and MLPs and only partial transfer of linguistic structure via individual components.

Abstract

Large Transformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained BERT weights are potentially useful. We also study the "good" subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns.

Paper Structure

This paper contains 24 sections, 5 equations, 20 figures, 2 tables.

Figures (20)

  • Figure 1: The "good" subnetworks for QNLI: self-attention heads (top, 12 x 12 heatmaps) and MLPs (bottom, 1x12 heatmaps), pruned together. Earlier layers start at 0.
  • Figure 2: The "good" and "bad" subnetworks in BERT fine-tuning: performance on GLUE tasks. 'Pruned' subnetworks are only pruned, and 'retrained' subnetworks are restored to pretrained weights and fine-tuned. Subfigure titles indicate the task and percentage of surviving weights. STD values and error bars indicate standard deviation of surviving weights and performance respectively, across 5 fine-tuning runs. See \ref{['appendix:evaluation-metrics-glue']} for numerical results, and \ref{['sec:bad']} for GLUE baseline discussion.
  • Figure 3: Head importance scores distribution (this example shows CoLA, pruning iteration 1)
  • Figure 4: Attention pattern type distribution
  • Figure 5: Overlaps in BERT's "good" subnetworks between GLUE tasks: self-attention heads.
  • ...and 15 more figures