Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression

Allan Raventós; Mansheej Paul; Feng Chen; Surya Ganguli

Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression

Allan Raventós, Mansheej Paul, Feng Chen, Surya Ganguli

TL;DR

The paper investigates how pretraining task diversity governs the emergence of in-context learning (ICL) for regression. Using a controlled linear-regression framework, it shows a task-diversity threshold that separates Bayesian behavior tied to the pretraining distribution from non-Bayesian, ridge-like behavior that enables learning new tasks in-context. Key findings include that the threshold scales roughly linearly with task dimension, that regularization and capacity modulate the threshold, and that beyond threshold the transformer can asymptotically match Ridge performance on unseen tasks, indicating true ICL. These results imply that ICL is an emergent phenomenon that requires sufficiently diverse pretraining data and scale, with important implications for understanding and improving ICL in language models.

Abstract

Pretrained transformers exhibit the remarkable ability of in-context learning (ICL): they can learn tasks from just a few examples provided in the prompt without updating any weights. This raises a foundational question: can ICL solve fundamentally $\textit{new}$ tasks that are very different from those seen during pretraining? To probe this question, we examine ICL's performance on linear regression while varying the diversity of tasks in the pretraining dataset. We empirically demonstrate a $\textit{task diversity threshold}$ for the emergence of ICL. Below this threshold, the pretrained transformer cannot solve unseen regression tasks, instead behaving like a Bayesian estimator with the $\textit{non-diverse pretraining task distribution}$ as the prior. Beyond this threshold, the transformer significantly outperforms this estimator; its behavior aligns with that of ridge regression, corresponding to a Gaussian prior over $\textit{all tasks}$, including those not seen during pretraining. Thus, when pretrained on data with task diversity greater than the threshold, transformers $\textit{can}$ optimally solve fundamentally new tasks in-context. Importantly, this capability hinges on it deviating from the Bayes optimal estimator with the pretraining distribution as the prior. This study also explores the effect of regularization, model capacity and task structure and underscores, in a concrete example, the critical role of task diversity, alongside data and model scale, in the emergence of ICL. Code is available at https://github.com/mansheej/icl-task-diversity.

Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression

TL;DR

Abstract

tasks that are very different from those seen during pretraining? To probe this question, we examine ICL's performance on linear regression while varying the diversity of tasks in the pretraining dataset. We empirically demonstrate a

for the emergence of ICL. Below this threshold, the pretrained transformer cannot solve unseen regression tasks, instead behaving like a Bayesian estimator with the

as the prior. Beyond this threshold, the transformer significantly outperforms this estimator; its behavior aligns with that of ridge regression, corresponding to a Gaussian prior over

, including those not seen during pretraining. Thus, when pretrained on data with task diversity greater than the threshold, transformers

optimally solve fundamentally new tasks in-context. Importantly, this capability hinges on it deviating from the Bayes optimal estimator with the pretraining distribution as the prior. This study also explores the effect of regularization, model capacity and task structure and underscores, in a concrete example, the critical role of task diversity, alongside data and model scale, in the emergence of ICL. Code is available at https://github.com/mansheej/icl-task-diversity.

Paper Structure (19 sections, 15 equations, 14 figures)

This paper contains 19 sections, 15 equations, 14 figures.

Introduction
Problem setup
Experiments and results
Task diversity threshold for the emergence of in-context learning
The PT exhibits superior scaling of task diversity threshold with dimension than dMMSE.
Effect of Regularization and model capacity on the task diversity threshold.
Related work
Discussion
Bayesian estimators
Bayesian MMSE estimator
dMMSE estimator
Ridge estimator
Experimental details
Support Figure 2
Dependence on number of sequences per task
...and 4 more sections

Figures (14)

Figure 1: Schematic for ICL of linear regression.(Left) A task corresponds to a latent regression vector, $\mathbf{w}$ (purple line). $(\mathbf{x}_1, y_1, ..., \mathbf{x}_{K}, y_{K})$ (black circles) is a sequence of in-context examples for this task. (Right) The PT, $f_\theta$, takes this as input and generates $K$ outputs. The $k$th output, $f_\theta(S_k)$, is the prediction for the target of $\mathbf{x}_k$ and depends only on the context $S_k = (\mathbf{x}_1, y_1, ..., \mathbf{x}_{k-1}, y_{k-1}, \mathbf{x}_k)$.
Figure 2: ICL emerges in PTs beyond a threshold pretraining task diversity. We show all results on both tasks seen during pretraining (top row) and on new tasks (bottom row). The left column compares the normalized loss of transformers pretrained with increasing task diversity to that of dMMSE and Ridge. When the pretraining task diversity is small, the PT's performance matches that of dMMSE; it performs very well on tasks seen during pretraining but poorly on new tasks. As the pretraining task diversity increases, both dMMSE and PT approach Ridge. However, the PT approaches Ridge much faster, significantly outperforming dMMSE on new tasks (bottom left). In the middle and right columns, we compare the PT's predictions to those of dMMSE and Ridge respectively (\ref{['eq:delta']}). We also increase the number of sequences per task at each level of task diversity by increasing the batch size while keeping total training steps fixed. This reveals a task diversity threshold between $2^{14}$ and $2^{15}$ pretraining tasks at which there is a phase transition in the behavior of the model. Below the threshold, increasing the dataset size leads to PTs with predictions more aligned with dMMSE on $\mathcal{T}_\text{Pretrain}$(top middle). However, beyond this threshold (indicated by gray shading), increasing the dataset size leads to PTs more aligned with Ridge on all tasks (right).
Figure 3: Increased pretraining steps reveals the same task diversity threshold for the emergence of ICL.Columns 1 and 2 in this figure are similar to the middle column in \ref{['fig:transition']} and columns 3 and 4 correspond to the right column in \ref{['fig:transition']}, except here we increase the number of sequences per task by increasing the number of training steps while keeping batch size = 256. Both methods of increasing dataset size---increasing batch size in \ref{['fig:transition']} and increasing training steps in this figure---reveal a transition in the behavior of the PT: beyond the task diversity threshold, ICL on new tasks emerges.
Figure 4: Learning dynamics of small PTs shows a transition at the task diversity threshold. We plot $\Delta^{\mathcal{T}_\text{Pretrain}}_{\text{PT,Ridge}}$ vs training steps for small PTs. For the same $M$, learning curves for short (500K steps, left) or long (2M steps, center) training durations are similar, and for $M > M^* \approx2^{11.5}$ learning curves are similar to that of a model trained with infinite task diversity. Right: For $M \leq 2^{10}$, $t^*$ (the training step at which $\Delta^{\mathcal{T}_\text{Pretrain}}_{\text{PT,Ridge}}$ is minimized) is well modeled by a scaling law $t^* \propto M^\alpha$. A linear fit of $\log t^*$ vs $\log M$ (dashed red line) gives $\alpha \approx 0.47$. But for $M > 2^{10}$, $\Delta^{\mathcal{T}_\text{Pretrain}}_{\text{PT,Ridge}}$ decreases through training; $t^* =$ 2M, is larger than predicted by the scaling law. This sudden break in the scaling law suggests a fundamental difference in the learning dynamics of models on either side of the threshold.
Figure 5: Transformers pretrained with high, but not low, task diversity can learn new tasks in-context. We compare the normalized loss of the PT to that of dMMSE and Ridge as we interpolate between tasks in the pretraining dataset. Left: At $2^5$ tasks, well below the task diversity threshold, the PT performance matches that of the dMMSE estimator along interpolating paths, but under-performs Ridge on new tasks at the center. Middle: At $2^{10}$ tasks, the PT outperforms dMMSE on new tasks at the center of the interpolation path, but is not yet as good as Ridge on new tasks. Right: At $M = 2^{15}$ tasks, just above the task diversity threshold, the PT performs as well as Ridge even on new tasks at the center. This demonstrates that, when pretrained on data with a finite but large number of unique tasks, the PT, unlike the Bayes optimal estimator for $\mathcal{T}_\text{Pretrain}$, can learn new tasks in-context.
...and 9 more figures

Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression

TL;DR

Abstract

Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression

Authors

TL;DR

Abstract

Table of Contents

Figures (14)