Active Exploration via Autoregressive Generation of Missing Data
Tiffany Tianhui Cai, Hongseok Namkoong, Daniel Russo, Kelly W Zhang
TL;DR
This paper reframes active exploration as uncertainty over missing future outcomes and replaces latent-parameter posterior sampling with autoregressive generation of those outcomes. It introduces an offline/online pipeline where a sequence model is trained on past tasks with unstructured priors Z and then used to generate missing outcomes online, enabling Thompson sampling in a context where in-context learning updates beliefs without retraining. The authors prove that offline next-outcome prediction quality governs online posterior-sampling accuracy and derive regret bounds that tie online performance to offline prediction loss, providing a principled reduction from online decision-making to offline sequence modeling. Empirical results in synthetic and semi-synthetic news-recommendation settings show accurate uncertainty quantification and low regret when using text-based priors, illustrating the practical potential of leveraging foundation-model information for informed exploration. Overall, the work offers a scalable, principled bridge between modern sequence modeling and adaptive decision-making, with broad implications for recommender systems and other interactive learning tasks.
Abstract
We pose uncertainty quantification and exploration in online decision-making as a problem of training and generation from an autoregressive sequence model, an area experiencing rapid innovation. Our approach rests on viewing uncertainty as arising from missing future outcomes that would be revealed through appropriate action choices, rather than from unobservable latent parameters of the environment. This reformulation aligns naturally with modern machine learning capabilities: we can i) train generative models through next-outcome prediction rather than fit explicit priors, ii) assess uncertainty through autoregressive generation rather than parameter sampling, and iii) adapt to new information through in-context learning rather than explicit posterior updating. To showcase these ideas, we formulate a challenging meta-bandit problem where effective performance requires leveraging unstructured prior information (like text features) while exploring judiciously to resolve key remaining uncertainties. We validate our approach through both theory and experiments. Our theory establishes a reduction, showing success at offline next-outcome prediction translates to reliable online uncertainty quantification and decision-making, even with strategically collected data. Semi-synthetic experiments show our insights bear out in a news-article recommendation task, where article text can be leveraged to minimize exploration.
