Table of Contents
Fetching ...

Architectural and Inferential Inductive Biases For Exchangeable Sequence Modeling

Daksh Mittal, Ang Li, Tzu-Ching Yen, Daniel Guetta, Hongseok Namkoong

TL;DR

This work addresses how to model exchangeable sequences with autoregressive transformers for decision-making under uncertainty. It argues that multi-step autoregressive inference, aligned with De Finetti's predictive view, is necessary to separate epistemic from aleatoric uncertainty, whereas one-step inference conflates them, harming downstream tasks. The paper shows that masking-based CPI architectures do not guarantee full exchangeability and introduces the c.i.d. property as essential, yet finds CPI alone is insufficient and often underperforms standard causal masking, which is computationally more efficient. Empirically, multi-step inference improves uncertainty quantification, bandit regret, and active learning sample efficiency, highlighting a need for new architectural inductive biases that truly enforce exchangeability while enabling efficient multi-step inference.

Abstract

Autoregressive models have emerged as a powerful framework for modeling exchangeable sequences - i.i.d. observations when conditioned on some latent factor - enabling direct modeling of uncertainty from missing data (rather than a latent). Motivated by the critical role posterior inference plays as a subroutine in decision-making (e.g., active learning, bandits), we study the inferential and architectural inductive biases that are most effective for exchangeable sequence modeling. For the inference stage, we highlight a fundamental limitation of the prevalent single-step generation approach: inability to distinguish between epistemic and aleatoric uncertainty. Instead, a long line of works in Bayesian statistics advocates for multi-step autoregressive generation; we demonstrate this "correct approach" enables superior uncertainty quantification that translates into better performance on downstream decision-making tasks. This naturally leads to the next question: which architectures are best suited for multi-step inference? We identify a subtle yet important gap between recently proposed Transformer architectures for exchangeable sequences (Muller et al., 2022; Nguyen & Grover, 2022; Ye & Namkoong, 2024), and prove that they in fact cannot guarantee exchangeability despite introducing significant computational overhead. We illustrate our findings using controlled synthetic settings, demonstrating how custom architectures can significantly underperform standard causal masks, underscoring the need for new architectural innovations.

Architectural and Inferential Inductive Biases For Exchangeable Sequence Modeling

TL;DR

This work addresses how to model exchangeable sequences with autoregressive transformers for decision-making under uncertainty. It argues that multi-step autoregressive inference, aligned with De Finetti's predictive view, is necessary to separate epistemic from aleatoric uncertainty, whereas one-step inference conflates them, harming downstream tasks. The paper shows that masking-based CPI architectures do not guarantee full exchangeability and introduces the c.i.d. property as essential, yet finds CPI alone is insufficient and often underperforms standard causal masking, which is computationally more efficient. Empirically, multi-step inference improves uncertainty quantification, bandit regret, and active learning sample efficiency, highlighting a need for new architectural inductive biases that truly enforce exchangeability while enabling efficient multi-step inference.

Abstract

Autoregressive models have emerged as a powerful framework for modeling exchangeable sequences - i.i.d. observations when conditioned on some latent factor - enabling direct modeling of uncertainty from missing data (rather than a latent). Motivated by the critical role posterior inference plays as a subroutine in decision-making (e.g., active learning, bandits), we study the inferential and architectural inductive biases that are most effective for exchangeable sequence modeling. For the inference stage, we highlight a fundamental limitation of the prevalent single-step generation approach: inability to distinguish between epistemic and aleatoric uncertainty. Instead, a long line of works in Bayesian statistics advocates for multi-step autoregressive generation; we demonstrate this "correct approach" enables superior uncertainty quantification that translates into better performance on downstream decision-making tasks. This naturally leads to the next question: which architectures are best suited for multi-step inference? We identify a subtle yet important gap between recently proposed Transformer architectures for exchangeable sequences (Muller et al., 2022; Nguyen & Grover, 2022; Ye & Namkoong, 2024), and prove that they in fact cannot guarantee exchangeability despite introducing significant computational overhead. We illustrate our findings using controlled synthetic settings, demonstrating how custom architectures can significantly underperform standard causal masks, underscoring the need for new architectural innovations.

Paper Structure

This paper contains 55 sections, 4 theorems, 28 equations, 57 figures, 3 algorithms.

Key Result

Theorem 1

If a sequence $Y_{1:\infty}$ is infintely exchangeable then there exists a latent parameter $\theta$ and a unique measure $\mu(\cdot)$ over $\theta$, such that

Figures (57)

  • Figure 1: Equivalence between conventional Bayesian modeling and predictive view of Bayesian inference is established by De Finneti (valid only under infinite exchangeability of $Y_{1:\infty}$). It establishes that epistemic uncertainty in latent parameter $\theta \sim \mu(\cdot|y_{1:t})$ is equivalent to predictive uncertainty of future observations $Y_{t+1:\infty} \sim P(\cdot|y_{1:t})$.
  • Figure 2: Meta-learned sequence models can be used for decision making
  • Figure 3: [Illustration of Multi-step inference v/s One-step inference in decision making] Coins A and B are considered identical by single-step inference because both have the same level of predictive uncertainty in their rewards. However, multi-step inference highlights a key difference: for Coin B, the uncertainty can be reduced (epistemic uncertainty) by performing a single toss, whereas for Coin A, all the uncertainty is irreducible (aleatoric) and arises from the inherent randomness of a fair coin toss. Consequently, multi-step inference prioritizes tossing Coin B first to reduce epistemic uncertainty.
  • Figure 5: A representative attention mechanism and masking scheme widely used in prior literature to enforce exchangeability. However, it only ensures the conditionally permutation-invariant property.
  • Figure 6: Standard causal transformer architecture: attention mechanism and masking scheme
  • ...and 52 more figures

Theorems & Definitions (9)

  • Theorem 1: De Finetti's theorem
  • Example 1
  • Theorem 2
  • Example 2
  • Theorem 3
  • Example 3
  • Theorem 4
  • Example 4
  • Example 5