Table of Contents
Fetching ...

LLMs are Bayesian, In Expectation, Not in Realization

Leon Chlon, Zein Khamis, Maggie Chlon, Mahdi El Zein, MarcAntonio M. Awada

TL;DR

Robustness checks extend beyond Bernoulli to categorical sequences, synthetic in-context learning tasks, and evidence-grounded QA with permuted exchangeable evidence chunks with nonzero expectation--realization gaps.

Abstract

Exchangeability-based martingale diagnostics have been used to question Bayesian explanations of transformer in-context learning. We show that these violations are compatible with Bayesian/MDL behavior once we account for a basic architectural fact: positional encodings break exchangeability. Accordingly, the relevant baseline is performance in expectation over orderings of an exchangeable multiset, not performance under every fixed ordering. In a Bernoulli microscope (under explicit regularity assumptions), we bound the permutation-induced dispersion detected by martingale diagnostics (Theorem~3.4) while proving near-optimal expected MDL/compression over permutations (Theorem~3.6). Empirically, black-box next-token log-probabilities from an Azure OpenAI deployment exhibit nonzero expectation--realization gaps that decay with context length (mean 0.74 at $n = 10$ to 0.26 at $n = 50$; 95\% confidence intervals), and permutation averaging reduces order-induced standard deviation with a $k^{-1/2}$ trend (Figure~2). Controlled from-scratch training ablations varying only the positional encoding show within-prefix order variance collapsing to $\approx 10^{-16}$ with no positional encoding, but remaining $10^{-8}$--$10^{-6}$ under standard positional encoding schemes (Table~2). Robustness checks extend beyond Bernoulli to categorical sequences, synthetic in-context learning tasks, and evidence-grounded QA with permuted exchangeable evidence chunks.

LLMs are Bayesian, In Expectation, Not in Realization

TL;DR

Robustness checks extend beyond Bernoulli to categorical sequences, synthetic in-context learning tasks, and evidence-grounded QA with permuted exchangeable evidence chunks with nonzero expectation--realization gaps.

Abstract

Exchangeability-based martingale diagnostics have been used to question Bayesian explanations of transformer in-context learning. We show that these violations are compatible with Bayesian/MDL behavior once we account for a basic architectural fact: positional encodings break exchangeability. Accordingly, the relevant baseline is performance in expectation over orderings of an exchangeable multiset, not performance under every fixed ordering. In a Bernoulli microscope (under explicit regularity assumptions), we bound the permutation-induced dispersion detected by martingale diagnostics (Theorem~3.4) while proving near-optimal expected MDL/compression over permutations (Theorem~3.6). Empirically, black-box next-token log-probabilities from an Azure OpenAI deployment exhibit nonzero expectation--realization gaps that decay with context length (mean 0.74 at to 0.26 at ; 95\% confidence intervals), and permutation averaging reduces order-induced standard deviation with a trend (Figure~2). Controlled from-scratch training ablations varying only the positional encoding show within-prefix order variance collapsing to with no positional encoding, but remaining -- under standard positional encoding schemes (Table~2). Robustness checks extend beyond Bernoulli to categorical sequences, synthetic in-context learning tasks, and evidence-grounded QA with permuted exchangeable evidence chunks.

Paper Structure

This paper contains 54 sections, 4 theorems, 33 equations, 3 figures, 3 tables.

Key Result

Lemma 3.3

Let $(X_i)_{i\ge 1}$ be exchangeable, and for any ordering $\pi$ define the online filtration $\mathcal{F}_t^\pi := \sigma(X_{\pi(1)},\ldots,X_{\pi(t)})$. In the Bernoulli model with latent parameter $p$ (and any prior), the Bayesian posterior predictive mean is (a) order-invariant---it is a function only of the multiset $\{X_{\pi(1)},\ldots,X_{\pi(t)}\}$ (equivalently of the count $S_t=\sum_{i=1

Figures (3)

  • Figure 1: Martingale gaps and periodic components. Proxy gaps $\hat{\Delta}_n$ decrease with $n$ and exhibit periodic structure; a debiasing step removes the period-64 component while preserving the slower trend (Sec. \ref{['sec:empirical']}).
  • Figure 2: Permutation averaging reduces order-induced standard deviation on ICL tasks. We generate procedurally defined few-shot classification tasks (threshold and linear2d), hold the demonstration set fixed, and vary only the demo order (200 tasks; 20 random orders per task). For each task we estimate $\mathrm{Std}(\bar{p}_A)$ under $k$ independent random orderings (resampling permutations with replacement); the resulting decay is consistent with $k^{-1/2}$.
  • Figure 3: Periodic components in the gap curve. The raw gap exhibits a strong period-64 component and harmonics; a harmonic+residual debiasing step removes these periodic components while leaving the slower gap-vs-$n$ trend.

Theorems & Definitions (11)

  • Definition 3.1: Position-Aware Transformer
  • Definition 3.2: Position-Agnostic Transformer
  • Lemma 3.3: Exchangeable Bayes Baseline Across Orderings
  • Theorem 3.4: Quantified Order-Induced Gap (Upper Bound)
  • Definition 3.5: Empirical MDL
  • Theorem 3.6: MDL Optimality of Position-Aware Transformers
  • Remark 3.7: Implicit pseudo-count behavior (informal)
  • Proposition 3.8: Permutation Averaging: Monte Carlo Variance Reduction
  • proof
  • proof
  • ...and 1 more