Table of Contents
Fetching ...

In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning

Tomoya Wakayama, Taiji Suzuki

TL;DR

The paper develops a Bayes-centric, finite-sample theory for in-context learning under a mixture of task types, decomposing ICL risk into a Bayes Gap and a model-independent Posterior Variance. It provides non-asymptotic bounds that couple pretraining prompts (N) and context length (p) for uniform-attention Transformers and proves rapid posterior concentration on the true task in mixtures, explaining fast adaptation at inference. An out-of-distribution stability analysis shows Bayes Gap sensitivity to Wasserstein shifts between pretraining and test prompts, while Posterior Variance remains tied to the target task. The results offer concrete design guidance (architecture scaling, permutation-invariant structures, and distribution alignment) and justify viewing ICL as implicit Bayesian inference with efficient meta-algorithm selection. Overall, the work unifies pretraining and in-context learning under a Bayesian framework and clarifies how ICL can rapidly approximate Bayes-optimal predictions in diverse task mixtures.

Abstract

This paper develops a finite-sample statistical theory for in-context learning (ICL), analyzed within a meta-learning framework that accommodates mixtures of diverse task types. We introduce a principled risk decomposition that separates the total ICL risk into two orthogonal components: Bayes Gap and Posterior Variance. The Bayes Gap quantifies how well the trained model approximates the Bayes-optimal in-context predictor. For a uniform-attention Transformer, we derive a non-asymptotic upper bound on this gap, which explicitly clarifies the dependence on the number of pretraining prompts and their context length. The Posterior Variance is a model-independent risk representing the intrinsic task uncertainty. Our key finding is that this term is determined solely by the difficulty of the true underlying task, while the uncertainty arising from the task mixture vanishes exponentially fast with only a few in-context examples. Together, these results provide a unified view of ICL: the Transformer selects the optimal meta-algorithm during pretraining and rapidly converges to the optimal algorithm for the true task at test time.

In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning

TL;DR

The paper develops a Bayes-centric, finite-sample theory for in-context learning under a mixture of task types, decomposing ICL risk into a Bayes Gap and a model-independent Posterior Variance. It provides non-asymptotic bounds that couple pretraining prompts (N) and context length (p) for uniform-attention Transformers and proves rapid posterior concentration on the true task in mixtures, explaining fast adaptation at inference. An out-of-distribution stability analysis shows Bayes Gap sensitivity to Wasserstein shifts between pretraining and test prompts, while Posterior Variance remains tied to the target task. The results offer concrete design guidance (architecture scaling, permutation-invariant structures, and distribution alignment) and justify viewing ICL as implicit Bayesian inference with efficient meta-algorithm selection. Overall, the work unifies pretraining and in-context learning under a Bayesian framework and clarifies how ICL can rapidly approximate Bayes-optimal predictions in diverse task mixtures.

Abstract

This paper develops a finite-sample statistical theory for in-context learning (ICL), analyzed within a meta-learning framework that accommodates mixtures of diverse task types. We introduce a principled risk decomposition that separates the total ICL risk into two orthogonal components: Bayes Gap and Posterior Variance. The Bayes Gap quantifies how well the trained model approximates the Bayes-optimal in-context predictor. For a uniform-attention Transformer, we derive a non-asymptotic upper bound on this gap, which explicitly clarifies the dependence on the number of pretraining prompts and their context length. The Posterior Variance is a model-independent risk representing the intrinsic task uncertainty. Our key finding is that this term is determined solely by the difficulty of the true underlying task, while the uncertainty arising from the task mixture vanishes exponentially fast with only a few in-context examples. Together, these results provide a unified view of ICL: the Transformer selects the optimal meta-algorithm during pretraining and rapidly converges to the optimal algorithm for the true task at test time.

Paper Structure

This paper contains 36 sections, 17 theorems, 164 equations, 3 figures.

Key Result

Theorem 1

Consider the prompt-generating process from Definition def:PGP and assume that Assumption ass:bounded_f holds. For a measurable, bounded map $M$, the ICL risk decomposes as where:

Figures (3)

  • Figure 1: Bayesian view of in-context learning (ICL). The upper path: the process of computing the optimal prediction is $(D^{k},\bm{x}_{k+1})\mapsto \mathbb{E}_{f\sim{\mathcal{P}}(f\mid D^k)}[f(\bm{x}_{k+1})]$ given ${\mathcal{P}}(f)$. The lower path: since ${\mathcal{P}}(f)$ is unknown, the model $M_{\hat{\theta}}$, pretrained on data from ${\mathcal{P}}(f)$, aims to emulate this process via $(D^{k},\bm{x}_{k+1})\mapsto M_{\hat{\theta}}(D^k,\bm{x}_{k+1})$.
  • Figure 2: Behavior of the Bayes Gap (left: $N$-sweep, right: $p$-sweep). The left panel fixes $p\in\{5,10,15\}$ and varies the number of pretraining prompts $N$; the right panel fixes $N\in\{500,1000,2000\}$ and varies the context length $p$ in pretraining. In both cases, the Bayes Gap decreases generally as $N$ or $p$ increases, demonstrating that longer contexts and more pretraining improve approximation to the Bayes predictor.
  • Figure 3: In-context error under task mixtures (left: predictive MSE; right: parameter-estimation MSE). As the context length $k$ increases, both predictive error for the next label $y_{k+1}$ from $P^k$ (left) and parameter-estimation error (right) decrease monotonically. The Transformer closely tracks the mixture Bayes predictor and, with sufficient context, approaches the oracle Bayes curve that knows the true task family. This demonstrates the rapid concentration of the task-index posterior under growing context and the corresponding shrinkage of the irreducible term.

Theorems & Definitions (37)

  • Definition 1: Prompt-Generating Process
  • Definition 2: Uniform-attention Transformer Architecture
  • Remark 1: Meta-train/test protocol
  • Theorem 1: Risk decomposition for in-context learning
  • Theorem 2: Bayes Gap upper bound
  • Theorem 3: Gap between Posterior Variance and minimax risk of the true task type
  • Theorem 4: Wasserstein stability of the Bayes Gap
  • Lemma 1: Conditional exchangeability
  • Theorem 5: Risk–reducing symmetrization
  • proof : Proof of Theorem \ref{['thm:symm']}
  • ...and 27 more