Table of Contents
Fetching ...

An Empirical Study of Scaling Laws for Transfer

Matthew Barnett

TL;DR

A scaling law that incorporates a transfer gap term is examined, indicating the effectiveness of pre-training on one distribution when optimizing for downstream performance on another distribution, contributing to a principled way to measure transfer learning efficiency and understand how data availability affects capabilities.

Abstract

We present a limited empirical study of scaling laws for transfer learning in transformer models. More specifically, we examine a scaling law that incorporates a "transfer gap" term, indicating the effectiveness of pre-training on one distribution when optimizing for downstream performance on another distribution. When the transfer gap is low, pre-training is a cost-effective strategy for improving downstream performance. Conversely, when the gap is high, collecting high-quality fine-tuning data becomes relatively more cost effective. Fitting the scaling law to experiments from diverse datasets reveals significant variations in the transfer gap across distributions. In theory, the scaling law can inform optimal data allocation strategies and highlights how the scarcity of downstream data can bottleneck performance. Our findings contribute to a principled way to measure transfer learning efficiency and understand how data availability affects capabilities.

An Empirical Study of Scaling Laws for Transfer

TL;DR

A scaling law that incorporates a transfer gap term is examined, indicating the effectiveness of pre-training on one distribution when optimizing for downstream performance on another distribution, contributing to a principled way to measure transfer learning efficiency and understand how data availability affects capabilities.

Abstract

We present a limited empirical study of scaling laws for transfer learning in transformer models. More specifically, we examine a scaling law that incorporates a "transfer gap" term, indicating the effectiveness of pre-training on one distribution when optimizing for downstream performance on another distribution. When the transfer gap is low, pre-training is a cost-effective strategy for improving downstream performance. Conversely, when the gap is high, collecting high-quality fine-tuning data becomes relatively more cost effective. Fitting the scaling law to experiments from diverse datasets reveals significant variations in the transfer gap across distributions. In theory, the scaling law can inform optimal data allocation strategies and highlights how the scarcity of downstream data can bottleneck performance. Our findings contribute to a principled way to measure transfer learning efficiency and understand how data availability affects capabilities.
Paper Structure (27 sections, 13 equations, 8 figures, 5 tables)

This paper contains 27 sections, 13 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: This plot presents the trade-offs between expanding pre-training and collecting more fine-tuning data to achieve low loss on the synthetic fictional encyclopedia dataset. The isolines, or lines of equivalent loss values, delineate the points at which equal loss is achievable given different combinations of pre-training steps and fine-tuning data points. The plot demonstrates that at low pre-training values, significant benefits can be gained from both increasing fine-tuning data and expanding pre-training. Conversely, at high pre-training values, the marginal benefit of additional pre-training diminishes, making the collection of more fine-tuning data points increasingly valuable for reducing loss.
  • Figure 2: The plots illustrate how the optimal budget allocation between pre-training and fine-tuning, $p$ and $f$, evolves under different conditions. Plot (a) shows the relationship as the transfer gap from the pre-training distribution to the fine-tuning distribution $G$ increases, where initially the budget is largely allocated towards pre-training ($p$), but as $G$ rises, the allocation shifts significantly towards fine-tuning ($f$). Plot (b) illustrates the budget allocation for fine-tuning data ($f$) as the ratio of the cost per fine-tuning data point to the cost per pre-training step ($C_f / C_p$) increases. Initially, the allocation towards fine-tuning ($f$) is higher, but as the cost ratio ($C_f / C_p$) increases, the budget allocation for pre-training ($p$) also increases, reflecting the impact of cost ratio on budget optimization for a fixed transfer gap.
  • Figure 3: This plot illustrates a cross-section of the fitted scaling law to the data for the fictional encyclopedia dataset, illustrating both clear transfer learning, and that the power law form provides a good fit in pre-training data steps. This empirical observation confirms our intuitions that the scaling law for transfer should reduce to a power law under various conditions. These conditions are detailed in \ref{['app:derive_scaling_law']}
  • Figure 4: Validation loss compared against the predicted values of the fitted scaling law in three dimensions, in logarithmic space, for the synthetic dataset. From visual inspection, the scaling law appears to be a close fit for the data, fitting the shape of the data points, and showing no obvious signs of overfitting.
  • Figure 5: Cross entropy loss as a function of pre-training data steps and fine-tuning data points across all datasets in the study. The plots show clear transfer learning, with decreasing loss with increasing pre-training values. The math arXiv dataset shows slowly decreasing loww from fine-tuning relative to pre-training, reflected in the scaling law by a relatively small fine-tuning exponent. The housecat genome data displays the most erratic and unclear pattern, likely reflecting instability caused from the high dissimilarity of this dataset from the pre-training data.
  • ...and 3 more figures