Table of Contents
Fetching ...

How Does the Pretraining Distribution Shape In-Context Learning? Task Selection, Generalization, and Robustness

Waïss Azizian, Ali Hasan

TL;DR

This work studies how the pretraining distribution affects in-context learning (ICL) by decomposing ICL into task selection and generalization. It develops a unified theoretical framework that extends Bayesian posterior consistency to heavy-tailed priors and dependent data, deriving task-retrieval concentration rates and generalization bounds that reveal a trade-off: heavier-tailed priors speed up task identification but worsen generalization, with bounds depending on tail moment $q$ and dependency terms. The authors validate the theory with numerical experiments on linear regression, Ornstein–Uhlenbeck, and Volterra processes, showing that distribution shifts favor heavier tails for robustness but require more pretraining tasks for reliable generalization. Practically, these insights guide the design of pretraining distributions to achieve robust, ICL-capable transformers on numerically challenging tasks, especially where memory and long-range dependencies are present.

Abstract

The emergence of in-context learning (ICL) in large language models (LLMs) remains poorly understood despite its consistent effectiveness, enabling models to adapt to new tasks from only a handful of examples. To clarify and improve these capabilities, we characterize how the statistical properties of the pretraining distribution (e.g., tail behavior, coverage) shape ICL on numerical tasks. We develop a theoretical framework that unifies task selection and generalization, extending and sharpening earlier results, and show how distributional properties govern sample efficiency, task retrieval, and robustness. To this end, we generalize Bayesian posterior consistency and concentration results to heavy-tailed priors and dependent sequences, better reflecting the structure of LLM pretraining data. We then empirically study how ICL performance varies with the pretraining distribution on challenging tasks such as stochastic differential equations and stochastic processes with memory. Together, these findings suggest that controlling key statistical properties of the pretraining distribution is essential for building ICL-capable and reliable LLMs.

How Does the Pretraining Distribution Shape In-Context Learning? Task Selection, Generalization, and Robustness

TL;DR

This work studies how the pretraining distribution affects in-context learning (ICL) by decomposing ICL into task selection and generalization. It develops a unified theoretical framework that extends Bayesian posterior consistency to heavy-tailed priors and dependent data, deriving task-retrieval concentration rates and generalization bounds that reveal a trade-off: heavier-tailed priors speed up task identification but worsen generalization, with bounds depending on tail moment and dependency terms. The authors validate the theory with numerical experiments on linear regression, Ornstein–Uhlenbeck, and Volterra processes, showing that distribution shifts favor heavier tails for robustness but require more pretraining tasks for reliable generalization. Practically, these insights guide the design of pretraining distributions to achieve robust, ICL-capable transformers on numerically challenging tasks, especially where memory and long-range dependencies are present.

Abstract

The emergence of in-context learning (ICL) in large language models (LLMs) remains poorly understood despite its consistent effectiveness, enabling models to adapt to new tasks from only a handful of examples. To clarify and improve these capabilities, we characterize how the statistical properties of the pretraining distribution (e.g., tail behavior, coverage) shape ICL on numerical tasks. We develop a theoretical framework that unifies task selection and generalization, extending and sharpening earlier results, and show how distributional properties govern sample efficiency, task retrieval, and robustness. To this end, we generalize Bayesian posterior consistency and concentration results to heavy-tailed priors and dependent sequences, better reflecting the structure of LLM pretraining data. We then empirically study how ICL performance varies with the pretraining distribution on challenging tasks such as stochastic differential equations and stochastic processes with memory. Together, these findings suggest that controlling key statistical properties of the pretraining distribution is essential for building ICL-capable and reliable LLMs.

Paper Structure

This paper contains 53 sections, 23 theorems, 183 equations, 13 figures, 1 table.

Key Result

Theorem 1

Let ${\color{black}\rho} \in (0,1)$, under asm:data_generation_informal, with ${\color{black}\pi}({\color{black}\theta}^{\ast}) > 0$ and ${\color{black}x}_{1:{\color{black}T}} \sim \densitywrt{{\color{black}T}}{\cdot \nonscript\,\delimsize\vert\nonscript\,\mathopen{} {\color{black}\theta}^*}$, the p where the terms in $\mathop{\mathrm{{\color{black}\mathcal{O}}}}\nolimits (*){\frac{\log {\color{bl

Figures (13)

  • Figure 1: Influence of the degree of freedom parameter of a Student-$t$ pretraining distribution on the ICL error for different task shifts with and without importance weighting. Weighted samples given by $-\star$ marker.
  • Figure 2: Generalization for linear regression with a Student-$t$ prior of varying $\nu$ as a function of $n$.
  • Figure 3: Influence of the degree of freedom parameter of a Student-$t$ pretraining distribution on the ICL error for different task shifts with and without importance weighting for predicting the next step in an OU process with context length of 32. Weighted samples indicated by the $-\star$ marker.
  • Figure 4: Influence of the shape of a generalized normal pretraining distribution on the ICL error for different task shifts with and without importance weighting for predicting the next step in an OU process.
  • Figure 5: Generalization of a transformer trained to predict the next step of the Volterra as a function of $n$ the number of tasks with context length of 32.
  • ...and 8 more figures

Theorems & Definitions (47)

  • Example 3.1: Classification
  • Example 3.2: Linear Regression
  • Example 3.3: Next-sample prediction for stochastic processes
  • Theorem 1: Task selection
  • Theorem 2
  • Definition 1: Kullback-Leibler divergence
  • Lemma D.1: Donsker-Varadhan lemma, Gibbs variational principle
  • Lemma D.2
  • proof
  • Proposition D.1: Template task selection bound
  • ...and 37 more