Table of Contents
Fetching ...

Exchangeable Sequence Models Quantify Uncertainty Over Latent Concepts

Naimeng Ye, Hongseok Namkoong

TL;DR

This work reframes uncertainty quantification in autoregressive sequence models through De Finetti's predictive view of exchangeability, showing that pre-trained sequence models can perform Bayesian inference over latent environments by forward-generating future data. By treating one-step autoregressive probabilities as posterior predictives, the authors connect perplexity training to empirical Bayes and prove that forward sampling yields explicit posterior draws for the latent environment, enabling both length generalization and statistical inference. They develop the Exchangeable Transformer and investigate inductive biases—data augmentation, CID regularization, and causal masking—to promote permutation invariance and robust uncertainty quantification, supported by a Bayesian linear regression case study and length-generalization analysis. The results provide a principled framework for uncertainty quantification in in-context learning, with potential practical impact on long-horizon predictions and decision-making under uncertainty in real-world tasks. Key findings include that the limiting perplexity $H(\widehat{p})$ governs long-horizon performance, the excess risk decays as $O(\frac{\log T}{T})$ under exchangeability, and forward generation effectively implements Bayesian bootstrap-like inference for latent environments.

Abstract

Intelligent agents must be able to articulate its own uncertainty. In this work, we show that pre-trained sequence models are naturally capable of probabilistic reasoning over exchangeable data points -- forming informed beliefs and sharpening them as it gathers more information. A sequence model learns the relationship between observations, which differs from typical Bayesian models that quantify uncertainty over latent parameters through priors and likelihoods (e.g., topic models). Despite the apparent difference, we illustrate how exchangeable sequence modeling provides a valid Bayesian model by going back to De Finetti's classical predictive view of probabilistic reasoning: uncertainty comes from data that has not been observed yet, rather than latent parameters. From this perspective, pre-training autoregressive models is equivalent to formulating informed beliefs based on prior observations ("empirical Bayes"), and forward generation is equivalent to simulating instantiations of an environment ("posterior inference"). In particular, exchangeable sequence models can explicitly perform statistical inference; epistemic uncertainty over latent environments is captured by variation in predicted future observations. Formally, we show the sequence prediction loss controls the quality of uncertainty quantification, and propose several approaches for encoding exchangeability in sequence model architectures: data augmentation, regularization, and causal masking.

Exchangeable Sequence Models Quantify Uncertainty Over Latent Concepts

TL;DR

This work reframes uncertainty quantification in autoregressive sequence models through De Finetti's predictive view of exchangeability, showing that pre-trained sequence models can perform Bayesian inference over latent environments by forward-generating future data. By treating one-step autoregressive probabilities as posterior predictives, the authors connect perplexity training to empirical Bayes and prove that forward sampling yields explicit posterior draws for the latent environment, enabling both length generalization and statistical inference. They develop the Exchangeable Transformer and investigate inductive biases—data augmentation, CID regularization, and causal masking—to promote permutation invariance and robust uncertainty quantification, supported by a Bayesian linear regression case study and length-generalization analysis. The results provide a principled framework for uncertainty quantification in in-context learning, with potential practical impact on long-horizon predictions and decision-making under uncertainty in real-world tasks. Key findings include that the limiting perplexity governs long-horizon performance, the excess risk decays as under exchangeability, and forward generation effectively implements Bayesian bootstrap-like inference for latent environments.

Abstract

Intelligent agents must be able to articulate its own uncertainty. In this work, we show that pre-trained sequence models are naturally capable of probabilistic reasoning over exchangeable data points -- forming informed beliefs and sharpening them as it gathers more information. A sequence model learns the relationship between observations, which differs from typical Bayesian models that quantify uncertainty over latent parameters through priors and likelihoods (e.g., topic models). Despite the apparent difference, we illustrate how exchangeable sequence modeling provides a valid Bayesian model by going back to De Finetti's classical predictive view of probabilistic reasoning: uncertainty comes from data that has not been observed yet, rather than latent parameters. From this perspective, pre-training autoregressive models is equivalent to formulating informed beliefs based on prior observations ("empirical Bayes"), and forward generation is equivalent to simulating instantiations of an environment ("posterior inference"). In particular, exchangeable sequence models can explicitly perform statistical inference; epistemic uncertainty over latent environments is captured by variation in predicted future observations. Formally, we show the sequence prediction loss controls the quality of uncertainty quantification, and propose several approaches for encoding exchangeability in sequence model architectures: data augmentation, regularization, and causal masking.
Paper Structure (39 sections, 9 theorems, 96 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 39 sections, 9 theorems, 96 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

If a sequence $Y_{1:\infty}$ is infintely exchangeable (Assumption def:exchangeable), then there exists a latent parameter $\theta$ and a measure $\pi(\cdot)$ over it, such that

Figures (9)

  • Figure 1: DeFinetti33's predictive view uncertainty in latent environment (mental state of the patient) as coming from future data (questions and answers). Building on this insight, we show the sequence prediction loss (perplexity) over exchangeable documents measures the quality of uncertainty quantification over latent environments. Thus, standard pre-training methods are in fact directly optimizing them through auto-differentiation and GPU parallelization.
  • Figure 1: Autoregressive bootstraps. $F^b_T$ is the empirical distribution of $(y_{1:s},\widehat{Y}_{s+1:T}^b)$.
  • Figure 2: (a) Given an observed sequence ("prompt"), autoregressive generation provides inferential capabilities by computing a statistic over the generated trajectory. The panel on the left plots trajectories of forward generated outcomes; the panel on the right plots the histogram of the empirical mean of a trajectory. Under permutation invariance (exchangeability), this histogram is a valid approximation of the posterior distribution over the population mean. (b) Autoregressive models provide approximate posterior draws via forward sampling. We plot the KL divergence between this approximate posterior of a latent parameter and the posterior produced by the oracle. Our experiments show that enforcing exchangeability via causal masking (Figure \ref{['fig:et']}) provides large gains in inferential capabilities with 41x less parameters.
  • Figure 3: Length generalization.
  • Figure 4: Statistical inference.
  • ...and 4 more figures

Theorems & Definitions (11)

  • Definition 1
  • Theorem 1: De Finetti's theorem
  • Definition 2
  • Proposition 2: Martingale property
  • Theorem 3
  • Theorem 4
  • Lemma 1: Reparameterization
  • Proposition 5
  • Theorem 6: BertiPrRi04
  • Lemma 2: SLLN for martingale differences
  • ...and 1 more