In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning
Tomoya Wakayama, Taiji Suzuki
TL;DR
The paper develops a Bayes-centric, finite-sample theory for in-context learning under a mixture of task types, decomposing ICL risk into a Bayes Gap and a model-independent Posterior Variance. It provides non-asymptotic bounds that couple pretraining prompts (N) and context length (p) for uniform-attention Transformers and proves rapid posterior concentration on the true task in mixtures, explaining fast adaptation at inference. An out-of-distribution stability analysis shows Bayes Gap sensitivity to Wasserstein shifts between pretraining and test prompts, while Posterior Variance remains tied to the target task. The results offer concrete design guidance (architecture scaling, permutation-invariant structures, and distribution alignment) and justify viewing ICL as implicit Bayesian inference with efficient meta-algorithm selection. Overall, the work unifies pretraining and in-context learning under a Bayesian framework and clarifies how ICL can rapidly approximate Bayes-optimal predictions in diverse task mixtures.
Abstract
This paper develops a finite-sample statistical theory for in-context learning (ICL), analyzed within a meta-learning framework that accommodates mixtures of diverse task types. We introduce a principled risk decomposition that separates the total ICL risk into two orthogonal components: Bayes Gap and Posterior Variance. The Bayes Gap quantifies how well the trained model approximates the Bayes-optimal in-context predictor. For a uniform-attention Transformer, we derive a non-asymptotic upper bound on this gap, which explicitly clarifies the dependence on the number of pretraining prompts and their context length. The Posterior Variance is a model-independent risk representing the intrinsic task uncertainty. Our key finding is that this term is determined solely by the difficulty of the true underlying task, while the uncertainty arising from the task mixture vanishes exponentially fast with only a few in-context examples. Together, these results provide a unified view of ICL: the Transformer selects the optimal meta-algorithm during pretraining and rapidly converges to the optimal algorithm for the true task at test time.
