Table of Contents
Fetching ...

Universal priors: solving empirical Bayes via Bayesian inference and pretraining

Nick Cannella, Anzo Teh, Yanjun Han, Yury Polyanskiy

TL;DR

This analysis leverages the classical phenomenon of posterior contraction in Bayesian statistics, showing that the pretrained transformer adapts to unknown test distributions precisely through posterior contraction.

Abstract

We theoretically justify the recent empirical finding of [Teh et al., 2025] that a transformer pretrained on synthetically generated data achieves strong performance on empirical Bayes (EB) problems. We take an indirect approach to this question: rather than analyzing the model architecture or training dynamics, we ask why a pretrained Bayes estimator, trained under a prespecified training distribution, can adapt to arbitrary test distributions. Focusing on Poisson EB problems, we identify the existence of universal priors such that training under these priors yields a near-optimal regret bound of $\widetilde{O}(\frac{1}{n})$ uniformly over all test distributions. Our analysis leverages the classical phenomenon of posterior contraction in Bayesian statistics, showing that the pretrained transformer adapts to unknown test distributions precisely through posterior contraction. This perspective also explains the phenomenon of length generalization, in which the test sequence length exceeds the training length, as the model performs Bayesian inference using a generalized posterior.

Universal priors: solving empirical Bayes via Bayesian inference and pretraining

TL;DR

This analysis leverages the classical phenomenon of posterior contraction in Bayesian statistics, showing that the pretrained transformer adapts to unknown test distributions precisely through posterior contraction.

Abstract

We theoretically justify the recent empirical finding of [Teh et al., 2025] that a transformer pretrained on synthetically generated data achieves strong performance on empirical Bayes (EB) problems. We take an indirect approach to this question: rather than analyzing the model architecture or training dynamics, we ask why a pretrained Bayes estimator, trained under a prespecified training distribution, can adapt to arbitrary test distributions. Focusing on Poisson EB problems, we identify the existence of universal priors such that training under these priors yields a near-optimal regret bound of uniformly over all test distributions. Our analysis leverages the classical phenomenon of posterior contraction in Bayesian statistics, showing that the pretrained transformer adapts to unknown test distributions precisely through posterior contraction. This perspective also explains the phenomenon of length generalization, in which the test sequence length exceeds the training length, as the model performs Bayesian inference using a generalized posterior.
Paper Structure (45 sections, 23 theorems, 107 equations, 5 figures, 1 algorithm)

This paper contains 45 sections, 23 theorems, 107 equations, 5 figures, 1 algorithm.

Key Result

Theorem 1.1

Let $M\to\infty$, and assume that the training procedure finds the global minimizer of eq:ERM. Then for a large enough hyperparameter $c_0>0$, the pretrained estimator $\widehat{\theta}^n$ in alg:universal_prior satisfies where $C=C(A,c_0)$ is an absolute constant depending only on $A$ and $c_0$.

Figures (5)

  • Figure 1: Regrets of different estimators with different test sequence lengths and test priors. For all transformers, the training sequence length is fixed to be $n=512$ (indicated by the vertical dotted green line)
  • Figure 2: The regret of the hierarchical Bayes estimator, as well as its mean squared distance to the trained transformer, under simple training PoPs $\Pi_m = \frac{1}{m}\sum_{i=1}^m G_i^{\otimes n}$.
  • Figure 3: Plots of the mean squared distance between transformer output and the hierarchical Bayes estimator using various $\alpha$-posteriors, with different training lengths $n$ and test lengths ${n_{\mathsf{test}}}$. This distance is indeed minimized at $\alpha \simeq \frac{n}{{n_{\mathsf{test}}}}$.
  • Figure 4: Mean squared distance between transformer output and the hierarchical Bayes estimator using various $\alpha$-posteriors, trained on $m = 5$ priors
  • Figure 5: Mean squared distance between transformer output and the hierarchical Bayes estimator using various $\alpha$-posteriors, trained on $m = 10$ priors

Theorems & Definitions (26)

  • Theorem 1.1
  • Lemma 1.2
  • Theorem 1.3
  • Lemma 1.4
  • Theorem 1.5
  • Lemma 2.1
  • Lemma 2.2
  • Lemma 2.3
  • Lemma 2.4
  • proof : Proof of \ref{['thm:general']}, assuming lemma:posterior_contraction
  • ...and 16 more