Exchangeable Sequence Models Quantify Uncertainty Over Latent Concepts
Naimeng Ye, Hongseok Namkoong
TL;DR
This work reframes uncertainty quantification in autoregressive sequence models through De Finetti's predictive view of exchangeability, showing that pre-trained sequence models can perform Bayesian inference over latent environments by forward-generating future data. By treating one-step autoregressive probabilities as posterior predictives, the authors connect perplexity training to empirical Bayes and prove that forward sampling yields explicit posterior draws for the latent environment, enabling both length generalization and statistical inference. They develop the Exchangeable Transformer and investigate inductive biases—data augmentation, CID regularization, and causal masking—to promote permutation invariance and robust uncertainty quantification, supported by a Bayesian linear regression case study and length-generalization analysis. The results provide a principled framework for uncertainty quantification in in-context learning, with potential practical impact on long-horizon predictions and decision-making under uncertainty in real-world tasks. Key findings include that the limiting perplexity $H(\widehat{p})$ governs long-horizon performance, the excess risk decays as $O(\frac{\log T}{T})$ under exchangeability, and forward generation effectively implements Bayesian bootstrap-like inference for latent environments.
Abstract
Intelligent agents must be able to articulate its own uncertainty. In this work, we show that pre-trained sequence models are naturally capable of probabilistic reasoning over exchangeable data points -- forming informed beliefs and sharpening them as it gathers more information. A sequence model learns the relationship between observations, which differs from typical Bayesian models that quantify uncertainty over latent parameters through priors and likelihoods (e.g., topic models). Despite the apparent difference, we illustrate how exchangeable sequence modeling provides a valid Bayesian model by going back to De Finetti's classical predictive view of probabilistic reasoning: uncertainty comes from data that has not been observed yet, rather than latent parameters. From this perspective, pre-training autoregressive models is equivalent to formulating informed beliefs based on prior observations ("empirical Bayes"), and forward generation is equivalent to simulating instantiations of an environment ("posterior inference"). In particular, exchangeable sequence models can explicitly perform statistical inference; epistemic uncertainty over latent environments is captured by variation in predicted future observations. Formally, we show the sequence prediction loss controls the quality of uncertainty quantification, and propose several approaches for encoding exchangeability in sequence model architectures: data augmentation, regularization, and causal masking.
