Contextual Thompson Sampling via Generation of Missing Data
Kelly W. Zhang, Tiffany Tianhui Cai, Hongseok Namkoong, Daniel Russo
TL;DR
This work reframes contextual bandits by treating uncertainty as missing, potentially observable outcomes and leverages offline-trained generative models to impute these outcomes at decision time. The algorithm samples a complete task dataset from a learned distribution, fits an oracle policy on the imputed data, and uses that policy for action selection, yielding a generative implementation of Thompson Sampling. Theoretical contributions include regret bounds that depend only on the offline sequence-model loss and extend to infinite policy classes via VC/Natarajan-dimension arguments, with an additional misspecification penalty $\sqrt{2(\ell(p_\theta)-\ell(p^*))}$ when the imputation model is approximate. Empirically, the method (TS-Gen) outperforms baselines in synthetic and semi-synthetic tasks, with a demonstrated relationship between offline loss and online regret, illustrating practical viability for uncertainty-aware decision-making with modern generative models. The framework enables flexible meta-learning across tasks and can accommodate constraints such as fairness, opening avenues for applying generative imputation in more complex decision problems.
Abstract
We introduce a framework for Thompson sampling (TS) contextual bandit algorithms, in which the algorithm's ability to quantify uncertainty and make decisions depends on the quality of a generative model that is learned offline. Instead of viewing uncertainty in the environment as arising from unobservable latent parameters, our algorithm treats uncertainty as stemming from missing, but potentially observable outcomes (including both future and counterfactual outcomes). If these outcomes were all observed, one could simply make decisions using an "oracle" policy fit on the complete dataset. Inspired by this conceptualization, at each decision-time, our algorithm uses a generative model to probabilistically impute missing outcomes, fits a policy using the imputed complete dataset, and uses that policy to select the next action. We formally show that this algorithm is a generative formulation of TS and establish a state-of-the-art regret bound. Notably, our regret bound depends on the generative model only through the quality of its offline prediction loss, and applies to any method of fitting the "oracle" policy.
