Table of Contents
Fetching ...

Active Exploration via Autoregressive Generation of Missing Data

Tiffany Tianhui Cai, Hongseok Namkoong, Daniel Russo, Kelly W Zhang

TL;DR

This paper reframes active exploration as uncertainty over missing future outcomes and replaces latent-parameter posterior sampling with autoregressive generation of those outcomes. It introduces an offline/online pipeline where a sequence model is trained on past tasks with unstructured priors Z and then used to generate missing outcomes online, enabling Thompson sampling in a context where in-context learning updates beliefs without retraining. The authors prove that offline next-outcome prediction quality governs online posterior-sampling accuracy and derive regret bounds that tie online performance to offline prediction loss, providing a principled reduction from online decision-making to offline sequence modeling. Empirical results in synthetic and semi-synthetic news-recommendation settings show accurate uncertainty quantification and low regret when using text-based priors, illustrating the practical potential of leveraging foundation-model information for informed exploration. Overall, the work offers a scalable, principled bridge between modern sequence modeling and adaptive decision-making, with broad implications for recommender systems and other interactive learning tasks.

Abstract

We pose uncertainty quantification and exploration in online decision-making as a problem of training and generation from an autoregressive sequence model, an area experiencing rapid innovation. Our approach rests on viewing uncertainty as arising from missing future outcomes that would be revealed through appropriate action choices, rather than from unobservable latent parameters of the environment. This reformulation aligns naturally with modern machine learning capabilities: we can i) train generative models through next-outcome prediction rather than fit explicit priors, ii) assess uncertainty through autoregressive generation rather than parameter sampling, and iii) adapt to new information through in-context learning rather than explicit posterior updating. To showcase these ideas, we formulate a challenging meta-bandit problem where effective performance requires leveraging unstructured prior information (like text features) while exploring judiciously to resolve key remaining uncertainties. We validate our approach through both theory and experiments. Our theory establishes a reduction, showing success at offline next-outcome prediction translates to reliable online uncertainty quantification and decision-making, even with strategically collected data. Semi-synthetic experiments show our insights bear out in a news-article recommendation task, where article text can be leveraged to minimize exploration.

Active Exploration via Autoregressive Generation of Missing Data

TL;DR

This paper reframes active exploration as uncertainty over missing future outcomes and replaces latent-parameter posterior sampling with autoregressive generation of those outcomes. It introduces an offline/online pipeline where a sequence model is trained on past tasks with unstructured priors Z and then used to generate missing outcomes online, enabling Thompson sampling in a context where in-context learning updates beliefs without retraining. The authors prove that offline next-outcome prediction quality governs online posterior-sampling accuracy and derive regret bounds that tie online performance to offline prediction loss, providing a principled reduction from online decision-making to offline sequence modeling. Empirical results in synthetic and semi-synthetic news-recommendation settings show accurate uncertainty quantification and low regret when using text-based priors, illustrating the practical potential of leveraging foundation-model information for informed exploration. Overall, the work offers a scalable, principled bridge between modern sequence modeling and adaptive decision-making, with broad implications for recommender systems and other interactive learning tasks.

Abstract

We pose uncertainty quantification and exploration in online decision-making as a problem of training and generation from an autoregressive sequence model, an area experiencing rapid innovation. Our approach rests on viewing uncertainty as arising from missing future outcomes that would be revealed through appropriate action choices, rather than from unobservable latent parameters of the environment. This reformulation aligns naturally with modern machine learning capabilities: we can i) train generative models through next-outcome prediction rather than fit explicit priors, ii) assess uncertainty through autoregressive generation rather than parameter sampling, and iii) adapt to new information through in-context learning rather than explicit posterior updating. To showcase these ideas, we formulate a challenging meta-bandit problem where effective performance requires leveraging unstructured prior information (like text features) while exploring judiciously to resolve key remaining uncertainties. We validate our approach through both theory and experiments. Our theory establishes a reduction, showing success at offline next-outcome prediction translates to reliable online uncertainty quantification and decision-making, even with strategically collected data. Semi-synthetic experiments show our insights bear out in a news-article recommendation task, where article text can be leveraged to minimize exploration.
Paper Structure (101 sections, 7 theorems, 65 equations, 18 figures, 4 algorithms)

This paper contains 101 sections, 7 theorems, 65 equations, 18 figures, 4 algorithms.

Key Result

Proposition 1

If, after employing some policy $\pi$ for $t-1$ time periods, Algorithm alg:posterior_sample is applied to the history $\mathcal{H}_{t-1}$ with sequence model $p_{\theta}$ to generate a potential outcomes table $\hat{\tau}_t$, then, Moreover, the data processing inequality implies that for any function $f$

Figures (18)

  • Figure 1: Traditional implementation of active exploration algorithms such as Thompson sampling requires on probabilistic models over latent parameters that get updated as more data is gathered. Instead, we view the source of uncertainty in decision-making as missing data and use autoregressive generation as the basic unit of probabilistic inference.
  • Figure 2: Daily online decision-making task. The modern opportunity and challenge in this decision-making problem is that we can use a LLM to read the news articles to form a rich prior. The algorithm integrates the text information with user feedback it observes during the day.
  • Figure 3: Offline meta learning problem structure. Data is pooled across prior tasks and used to learn inputs or parameters of a policy which is then deployed, independently, to govern decisions within future tasks. Our algorithms use the offline data to train a sequence model on which policies depend.
  • Figure 4: Informed exploration must combine two distinct data modalities: unstructured prior information and observed outcomes of interactions. Our autoregressive approach rests on deep connections between optimized sequence models and probabilistic inference. To achieve good sequence prediction ("in-context" learning), the model must implicitly comprehend uncertainty in future outcomes: it must learn to initially rely on the foundation model "prior" based on environment features, and increasingly rely on past outcomes as they accumulate (prior washes out).
  • Figure 5: Posterior sampling via autoregressive generation (PSAR). The design of this figure reflects a motivating story in which actions correspond to newly released news articles, outcomes reflect the engagement levels of users in a certain subpopulation, and $Z$'s reflect article text. Figure a) depicts autoregressive generation of missing data in each row of the potential outcomes table $\tau$. Figure b) shows multiple autoregressive generations (imputations) of a single row. High (or low) variability across model generations indicates high (or low) uncertainty in average rewards. Low uncertainty may be due to prior information $Z^{(a)}$ being highly informative or due to revealed potential outcomes ($Y$'s) being informative.
  • ...and 13 more figures

Theorems & Definitions (22)

  • Remark 1: Relevance to modern recommender systems
  • Remark 2: Distinguishing epistemic and aleotric uncertainty
  • Remark 3: What is known in theory
  • Proposition 1
  • Corollary 1: Exact Thompson sampling interpretation of Algorithm \ref{['alg:Thompson']}
  • proof
  • Definition 1: Bandit simulator
  • Proposition 2
  • Corollary 2
  • proof : Proof sketch
  • ...and 12 more