Table of Contents
Fetching ...

When can in-context learning generalize out of task distribution?

Chase Goddard, Lindsay M. Smith, Vudtiwat Ngampruetikorn, David J. Schwab

TL;DR

This work investigates when in-context learning (ICL) in transformers can generalize to tasks outside the pretraining distribution, focusing on linear-regression tasks and introducing task-space diversity via spherical-cap distributions. Using a phase-diagram framework, the authors show a specialization-to-generalization transition around a critical task-diversity threshold (φ ≈ 120°), and demonstrate that models can outperform Bayes-optimal OOD estimators and, in some cases, resemble OLS solutions with sufficient context. The study extends the phenomenon to nonlinear regression and classification, finding robust transitions across dimensions and depths, and reveals an interplay between task-space diversity and the number of pretraining tasks. These results suggest that task diversity, not just the number of tasks, governs when ICL becomes a general-purpose tool, with implications for understanding generalization in language models and designing more robust pretraining regimes.

Abstract

In-context learning (ICL) is a remarkable capability of pretrained transformers that allows models to generalize to unseen tasks after seeing only a few examples. We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize \emph{out-of-distribution}. Previous work has focused on the number of distinct tasks necessary in the pretraining dataset. Here, we use a different notion of task diversity to study the emergence of ICL in transformers trained on linear functions. We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining task distribution, to a solution which generalizes out of distribution to the entire task space. We also investigate the nature of the solutions learned by the transformer on both sides of the transition, and observe similar transitions in nonlinear regression problems. We construct a phase diagram to characterize how our concept of task diversity interacts with the number of pretraining tasks. In addition, we explore how factors such as the depth of the model and the dimensionality of the regression problem influence the transition.

When can in-context learning generalize out of task distribution?

TL;DR

This work investigates when in-context learning (ICL) in transformers can generalize to tasks outside the pretraining distribution, focusing on linear-regression tasks and introducing task-space diversity via spherical-cap distributions. Using a phase-diagram framework, the authors show a specialization-to-generalization transition around a critical task-diversity threshold (φ ≈ 120°), and demonstrate that models can outperform Bayes-optimal OOD estimators and, in some cases, resemble OLS solutions with sufficient context. The study extends the phenomenon to nonlinear regression and classification, finding robust transitions across dimensions and depths, and reveals an interplay between task-space diversity and the number of pretraining tasks. These results suggest that task diversity, not just the number of tasks, governs when ICL becomes a general-purpose tool, with implications for understanding generalization in language models and designing more robust pretraining regimes.

Abstract

In-context learning (ICL) is a remarkable capability of pretrained transformers that allows models to generalize to unseen tasks after seeing only a few examples. We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize \emph{out-of-distribution}. Previous work has focused on the number of distinct tasks necessary in the pretraining dataset. Here, we use a different notion of task diversity to study the emergence of ICL in transformers trained on linear functions. We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining task distribution, to a solution which generalizes out of distribution to the entire task space. We also investigate the nature of the solutions learned by the transformer on both sides of the transition, and observe similar transitions in nonlinear regression problems. We construct a phase diagram to characterize how our concept of task diversity interacts with the number of pretraining tasks. In addition, we explore how factors such as the depth of the model and the dimensionality of the regression problem influence the transition.

Paper Structure

This paper contains 34 sections, 1 theorem, 20 equations, 19 figures.

Key Result

Proposition 1.1

$Z = R(2U-1)$, with $U\sim \operatorname{Beta}(\frac{d}{2},\frac{d}{2})$

Figures (19)

  • Figure 1: Testing ICL generalization via task similarity.A: The transformer takes as input a sequence of pairs $\{x_i,y_i\}_{i=1}^n$ and is trained to predict $y_k$ from a context $C_k = \{x_1, y_1, \ldots, x_k\}$. The elements $x_i$ and $y_i$ are related linearly by a task $w$ via $y_i = w^Tx_i + \epsilon_i$. B: The training tasks $w_\mathrm{train}$ are drawn from a hyperspherical cap with half-angle $\phi$ (with $\phi=180^\circ$ corresponding to the entire hypersphere). The test tasks $w_\mathrm{test}$ are drawn from a hyperspherical band of width $\Delta\delta$ starting an angle $\delta$ away from the "pole" of the sphere.
  • Figure 2: Task distribution diversity induces a transition from specialized to general-purpose ICL.A: Test error in $\Delta\delta = 5^\circ$ bands (see Fig \ref{['fig:setup']}) for transformers pretrained to do in-context learning of linear functions with pretraining task distributions $p_\phi(w)$. For distributions with $\phi \lesssim 120^\circ$, the transformer learns a specialized solution that performs well on unseen tasks drawn from the $p_\phi(w)$, but fails for tasks outside this distribution. However, for pretraining distributions with $\phi \gtrsim 120^\circ$, the transformer learns a solution that performs well for all test angles $\delta$. Here, the label noise is $\sigma^2 = 0$. B: With $\sigma^2 = 0.25$, we still observe a transition from a specialized to a generic solution, but the transition point has moved to $\phi \approx 135^\circ$. The vertical axis measures the excess test error above the noise floor set by $\sigma^2$.
  • Figure 3: Pretrained transformers outperform Bayes-optimal solutions in out-of-task-distribution generalization. For $\delta=175^\circ$ (shaded grey region), we plot the excess test loss for models with varying pretraining distributions. The dashed line shows the test error for the optimal in-task-distribution Bayesian solution (see section \ref{['sec:optimal-bayes']}).
  • Figure 4: Specialized ICL outperforms OLS for small context length. We evaluate the models in-task-distribution for varying context lengths, and plot the performance of the transformer (solid) and ordinary least squares (dashed) for the same data. For low context length, the specialized solution learned by models with $\phi \lesssim 90^\circ$ outperforms OLS. For $\phi = 15^\circ$, the specialized solution is worse than OLS for large context length.
  • Figure 5: Transformers trained to do ICL on the sphere generalize beyond it.A: The test error for tasks drawn uniformly from subsets of a hypersphere of radius $R$, when a model is pretrained on tasks taken only from subsets of the unit hypersphere. When $\phi \gtrsim 45^\circ$, the model generalizes to tasks with $R<1$ (shaded), despite being pretrained with $R=1$. B: Increasing task diversity drives generalization beyond the sphere: With sufficient task diversity ($\phi \gtrsim 45^\circ$), transformers generalize not only to OOD tasks on the sphere (Fig \ref{['fig:transition']}), but also to OOD tasks within it.
  • ...and 14 more figures

Theorems & Definitions (2)

  • Proposition 1.1
  • proof