Table of Contents
Fetching ...

Is In-Context Learning in Large Language Models Bayesian? A Martingale Perspective

Fabian Falck, Ziyu Wang, Chris Holmes

TL;DR

The paper investigates whether in-context learning (ICL) in large language models is Bayesian by formalizing a martingale property that underpins exchangeability and a principled uncertainty decomposition. It derives diagnostics to test the martingale property and epistemic uncertainty, and evaluates several state-of-the-art LLMs on synthetic Bernoulli, Gaussian, and language tasks against Bayesian baselines. Across short horizons some models appear roughly consistent with the martingale property, but longer horizons reveal systematic, model-dependent deviations and Bayesian-scale misspecifications in epistemic uncertainty, falsifying the claim that ICL is Bayesian. The results underscore the importance of martingale-consistent conditioning for trustworthy uncertainty quantification and motivate development of models and prompts that respect exchangeability, especially in safety-critical applications.

Abstract

In-context learning (ICL) has emerged as a particularly remarkable characteristic of Large Language Models (LLM): given a pretrained LLM and an observed dataset, LLMs can make predictions for new data points from the same distribution without fine-tuning. Numerous works have postulated ICL as approximately Bayesian inference, rendering this a natural hypothesis. In this work, we analyse this hypothesis from a new angle through the martingale property, a fundamental requirement of a Bayesian learning system for exchangeable data. We show that the martingale property is a necessary condition for unambiguous predictions in such scenarios, and enables a principled, decomposed notion of uncertainty vital in trustworthy, safety-critical systems. We derive actionable checks with corresponding theory and test statistics which must hold if the martingale property is satisfied. We also examine if uncertainty in LLMs decreases as expected in Bayesian learning when more data is observed. In three experiments, we provide evidence for violations of the martingale property, and deviations from a Bayesian scaling behaviour of uncertainty, falsifying the hypothesis that ICL is Bayesian.

Is In-Context Learning in Large Language Models Bayesian? A Martingale Perspective

TL;DR

The paper investigates whether in-context learning (ICL) in large language models is Bayesian by formalizing a martingale property that underpins exchangeability and a principled uncertainty decomposition. It derives diagnostics to test the martingale property and epistemic uncertainty, and evaluates several state-of-the-art LLMs on synthetic Bernoulli, Gaussian, and language tasks against Bayesian baselines. Across short horizons some models appear roughly consistent with the martingale property, but longer horizons reveal systematic, model-dependent deviations and Bayesian-scale misspecifications in epistemic uncertainty, falsifying the claim that ICL is Bayesian. The results underscore the importance of martingale-consistent conditioning for trustworthy uncertainty quantification and motivate development of models and prompts that respect exchangeability, especially in safety-critical applications.

Abstract

In-context learning (ICL) has emerged as a particularly remarkable characteristic of Large Language Models (LLM): given a pretrained LLM and an observed dataset, LLMs can make predictions for new data points from the same distribution without fine-tuning. Numerous works have postulated ICL as approximately Bayesian inference, rendering this a natural hypothesis. In this work, we analyse this hypothesis from a new angle through the martingale property, a fundamental requirement of a Bayesian learning system for exchangeable data. We show that the martingale property is a necessary condition for unambiguous predictions in such scenarios, and enables a principled, decomposed notion of uncertainty vital in trustworthy, safety-critical systems. We derive actionable checks with corresponding theory and test statistics which must hold if the martingale property is satisfied. We also examine if uncertainty in LLMs decreases as expected in Bayesian learning when more data is observed. In three experiments, we provide evidence for violations of the martingale property, and deviations from a Bayesian scaling behaviour of uncertainty, falsifying the hypothesis that ICL is Bayesian.
Paper Structure (43 sections, 4 theorems, 17 equations, 13 figures)

This paper contains 43 sections, 4 theorems, 17 equations, 13 figures.

Key Result

Proposition 1

A sequence $\{Z_{n+1:n+m}\} \sim p_M(\cdot\vert Z_{1:n})$ satisfies the martingale property if and only if the following holds: for all $n',k\in\mathbb{N}$ and integrable functions $g,h$:

Figures (13)

  • Figure 1: In-context learning in Large Language Models is not Bayesian. [Left] The martingale property, a necessary condition of Bayesian learning systems, is satisfied for short sample paths. [Centre] This allows us to approximate the martingale posterior (see § \ref{['sec:The martingale property enables a principled notion of uncertainty']}) which, however, indicates deviation from a reference Bayesian model. [Right] For longer sample paths, we observe a drift which violates the martingale property, together rendering the ICL system non-Bayesian.
  • Figure 2: The martingale property, a fundamental requirement of a Bayesian learning system, requires invariance with respect to missing samples from a population.
  • Figure 3: Checking the martingale property on Bernoulli experiments. Each data point represents a test statistic (y-axis) evaluated for an LLM, as derived in §\ref{['sec:CID-check']}. Subplot and x-axis correspond to choices of Bernoulli probabilities and LLMs. Shade indicates the $95\%$ confidence interval from a reference Bayesian model.
  • Figure 4: Checking the martingale property on Gaussian experiments. We present runs with $\theta=-1,n=100,m=50$ from different LLMs (x-axis) with test functions $g(z)=z$ and $g(z)=z^2$. See Fig. \ref{['fig:cid-bern-main']} for further details.
  • Figure 5: Checking the martingale property on the natural language experiment. We present both checks with test statistics computed separately for each value of $X_i$ (x-axis). See Fig. \ref{['fig:cid-bern-main']} for further details.
  • ...and 8 more figures

Theorems & Definitions (11)

  • Definition 1
  • Example 1
  • Proposition 1
  • Corollary 1
  • Example 2
  • proof
  • Proposition 1
  • proof
  • Corollary 1
  • proof
  • ...and 1 more