Investigating Transferability in Pretrained Language Models
Alex Tamkin, Trisha Singh, Davide Giovanardi, Noah Goodman
TL;DR
This work interrogates how pretrained language models transfer to downstream tasks by introducing partial reinitialization to measure layer-wise parameter transferability in BERT. It reveals that transferability is highly sensitive to finetuning data size and task, and that the benefits of pretrained parameters arise from complex interactions across layers rather than from any single layer’s fixed representation. Probing results do not reliably predict transfer performance in data-rich settings, though they align more with transferability under data scarcity, highlighting distinct but related notions of transfer. The study also shows that the order of layers matters, and scrambling layers disrupts learned inter-layer dynamics, underscoring the need for methods beyond probing to understand and optimize transfer in pretrained models.
Abstract
How does language model pretraining help transfer learning? We consider a simple ablation technique for determining the impact of each pretrained layer on transfer task performance. This method, partial reinitialization, involves replacing different layers of a pretrained model with random weights, then finetuning the entire model on the transfer task and observing the change in performance. This technique reveals that in BERT, layers with high probing performance on downstream GLUE tasks are neither necessary nor sufficient for high accuracy on those tasks. Furthermore, the benefit of using pretrained parameters for a layer varies dramatically with finetuning dataset size: parameters that provide tremendous performance improvement when data is plentiful may provide negligible benefits in data-scarce settings. These results reveal the complexity of the transfer learning process, highlighting the limitations of methods that operate on frozen models or single data samples.
