Table of Contents
Fetching ...

Investigating Transferability in Pretrained Language Models

Alex Tamkin, Trisha Singh, Davide Giovanardi, Noah Goodman

TL;DR

This work interrogates how pretrained language models transfer to downstream tasks by introducing partial reinitialization to measure layer-wise parameter transferability in BERT. It reveals that transferability is highly sensitive to finetuning data size and task, and that the benefits of pretrained parameters arise from complex interactions across layers rather than from any single layer’s fixed representation. Probing results do not reliably predict transfer performance in data-rich settings, though they align more with transferability under data scarcity, highlighting distinct but related notions of transfer. The study also shows that the order of layers matters, and scrambling layers disrupts learned inter-layer dynamics, underscoring the need for methods beyond probing to understand and optimize transfer in pretrained models.

Abstract

How does language model pretraining help transfer learning? We consider a simple ablation technique for determining the impact of each pretrained layer on transfer task performance. This method, partial reinitialization, involves replacing different layers of a pretrained model with random weights, then finetuning the entire model on the transfer task and observing the change in performance. This technique reveals that in BERT, layers with high probing performance on downstream GLUE tasks are neither necessary nor sufficient for high accuracy on those tasks. Furthermore, the benefit of using pretrained parameters for a layer varies dramatically with finetuning dataset size: parameters that provide tremendous performance improvement when data is plentiful may provide negligible benefits in data-scarce settings. These results reveal the complexity of the transfer learning process, highlighting the limitations of methods that operate on frozen models or single data samples.

Investigating Transferability in Pretrained Language Models

TL;DR

This work interrogates how pretrained language models transfer to downstream tasks by introducing partial reinitialization to measure layer-wise parameter transferability in BERT. It reveals that transferability is highly sensitive to finetuning data size and task, and that the benefits of pretrained parameters arise from complex interactions across layers rather than from any single layer’s fixed representation. Probing results do not reliably predict transfer performance in data-rich settings, though they align more with transferability under data scarcity, highlighting distinct but related notions of transfer. The study also shows that the order of layers matters, and scrambling layers disrupts learned inter-layer dynamics, underscoring the need for methods beyond probing to understand and optimize transfer in pretrained models.

Abstract

How does language model pretraining help transfer learning? We consider a simple ablation technique for determining the impact of each pretrained layer on transfer task performance. This method, partial reinitialization, involves replacing different layers of a pretrained model with random weights, then finetuning the entire model on the transfer task and observing the change in performance. This technique reveals that in BERT, layers with high probing performance on downstream GLUE tasks are neither necessary nor sufficient for high accuracy on those tasks. Furthermore, the benefit of using pretrained parameters for a layer varies dramatically with finetuning dataset size: parameters that provide tremendous performance improvement when data is plentiful may provide negligible benefits in data-scarce settings. These results reveal the complexity of the transfer learning process, highlighting the limitations of methods that operate on frozen models or single data samples.

Paper Structure

This paper contains 25 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: The three experiments we explore. Lighter shades indicate randomly reinitialized layers, while darker shades indicate layers with BERT parameters. For layer permutations, all layers hold BERT parameters, what changes between trials is their order. In all three experiments, the entire model is finetuned end-to-end on the GLUE task.
  • Figure 2: The benefit of using BERT parameters instead of random parameters at a particular layer varies dramatically depending on the size of the finetuning dataset. However, as finetuning dataset size decreases, the curves align more closely with probing performance at each layer. Solid lines show finetuning results after reinitializing all layers past layer $k$ in BERT-Base. 12 shows the full BERT model, while 0 shows a model with all layers reinitialized. Line darkness indicates subsampled dataset size. The dashed lines show probing performance at each layer. Error bars are 95% CIs.
  • Figure 3: Early layers provide the most QNLI gains, but middle ones yield an added boost for CoLA and SST-2. Finetuning results for 1) reinitializing a consecutive three-layer block ("block reinitialized") and 2) reinitializing all other layers ("block preserved"). Dashed horizontal lines show the finetuning performance of the full BERT model and the performance of a model with only embedding parameters preserved. Finetuning trials with 5k examples. Error bars are 95% CIs.
  • Figure 4: Changing the order of pretrained layers harms finetuning performance significantly. Dashed lines mark the performance of the original BERT model and the randomly-initialized model (surrounded by $\pm 2\sigma$ error bars). Circles denote finetuning performance for different layer permutations, while the solid line denotes the mean across runs (with 95% CIs). The curved shaded region is a kernel density plot, which illustrates the distribution of outcomes. Finetuning trials with 5k examples.
  • Figure 5: Finetuning results after reinitializing all layers past layer $k$ in BERT-Base. 12 shows the full BERT model, while 0 shows a model with all layers reinitialized. Scatterplot of 50 trials per layer shown for subsampled dataset size 500. Dotted line shows the mean.
  • ...and 3 more figures