Table of Contents
Fetching ...

Contextual Thompson Sampling via Generation of Missing Data

Kelly W. Zhang, Tiffany Tianhui Cai, Hongseok Namkoong, Daniel Russo

TL;DR

This work reframes contextual bandits by treating uncertainty as missing, potentially observable outcomes and leverages offline-trained generative models to impute these outcomes at decision time. The algorithm samples a complete task dataset from a learned distribution, fits an oracle policy on the imputed data, and uses that policy for action selection, yielding a generative implementation of Thompson Sampling. Theoretical contributions include regret bounds that depend only on the offline sequence-model loss and extend to infinite policy classes via VC/Natarajan-dimension arguments, with an additional misspecification penalty $\sqrt{2(\ell(p_\theta)-\ell(p^*))}$ when the imputation model is approximate. Empirically, the method (TS-Gen) outperforms baselines in synthetic and semi-synthetic tasks, with a demonstrated relationship between offline loss and online regret, illustrating practical viability for uncertainty-aware decision-making with modern generative models. The framework enables flexible meta-learning across tasks and can accommodate constraints such as fairness, opening avenues for applying generative imputation in more complex decision problems.

Abstract

We introduce a framework for Thompson sampling (TS) contextual bandit algorithms, in which the algorithm's ability to quantify uncertainty and make decisions depends on the quality of a generative model that is learned offline. Instead of viewing uncertainty in the environment as arising from unobservable latent parameters, our algorithm treats uncertainty as stemming from missing, but potentially observable outcomes (including both future and counterfactual outcomes). If these outcomes were all observed, one could simply make decisions using an "oracle" policy fit on the complete dataset. Inspired by this conceptualization, at each decision-time, our algorithm uses a generative model to probabilistically impute missing outcomes, fits a policy using the imputed complete dataset, and uses that policy to select the next action. We formally show that this algorithm is a generative formulation of TS and establish a state-of-the-art regret bound. Notably, our regret bound depends on the generative model only through the quality of its offline prediction loss, and applies to any method of fitting the "oracle" policy.

Contextual Thompson Sampling via Generation of Missing Data

TL;DR

This work reframes contextual bandits by treating uncertainty as missing, potentially observable outcomes and leverages offline-trained generative models to impute these outcomes at decision time. The algorithm samples a complete task dataset from a learned distribution, fits an oracle policy on the imputed data, and uses that policy for action selection, yielding a generative implementation of Thompson Sampling. Theoretical contributions include regret bounds that depend only on the offline sequence-model loss and extend to infinite policy classes via VC/Natarajan-dimension arguments, with an additional misspecification penalty when the imputation model is approximate. Empirically, the method (TS-Gen) outperforms baselines in synthetic and semi-synthetic tasks, with a demonstrated relationship between offline loss and online regret, illustrating practical viability for uncertainty-aware decision-making with modern generative models. The framework enables flexible meta-learning across tasks and can accommodate constraints such as fairness, opening avenues for applying generative imputation in more complex decision problems.

Abstract

We introduce a framework for Thompson sampling (TS) contextual bandit algorithms, in which the algorithm's ability to quantify uncertainty and make decisions depends on the quality of a generative model that is learned offline. Instead of viewing uncertainty in the environment as arising from unobservable latent parameters, our algorithm treats uncertainty as stemming from missing, but potentially observable outcomes (including both future and counterfactual outcomes). If these outcomes were all observed, one could simply make decisions using an "oracle" policy fit on the complete dataset. Inspired by this conceptualization, at each decision-time, our algorithm uses a generative model to probabilistically impute missing outcomes, fits a policy using the imputed complete dataset, and uses that policy to select the next action. We formally show that this algorithm is a generative formulation of TS and establish a state-of-the-art regret bound. Notably, our regret bound depends on the generative model only through the quality of its offline prediction loss, and applies to any method of fitting the "oracle" policy.

Paper Structure

This paper contains 66 sections, 9 theorems, 44 equations, 12 figures, 4 algorithms.

Key Result

Proposition 1

Algorithm alg:Thompson with imputation model $p^*$ implements Thompson Sampling (probability matching), i.e., the following holds almost surely:

Figures (12)

  • Figure 1: Potential outcomes table for a task $\tau$.
  • Figure 2: News recommendation meta contextual bandit problem.
  • Figure 3: The agent imputes missing outcomes and uses the imputed dataset to fit a policy.
  • Figure 4: Offline meta-learning and online decision-making across multiple tasks.
  • Figure 5: Posterior sampling via autoregressive generation (Algorithm \ref{['alg:posterior_sample']}).
  • ...and 7 more figures

Theorems & Definitions (18)

  • Proposition 1: Algorithm \ref{['alg:Thompson']} Implements Thompson Sampling
  • Theorem 1: Regret bound for Generative TS with a perfectly calibrated imputation model $p^*$
  • Proposition 2: Complexity bound on entropy
  • Theorem 2: Regret bound for Generative TS with an approximate imputation model
  • proof
  • proof
  • Lemma 1: Decomposing loss under $p_\theta$
  • proof
  • Lemma 2: KL Divergence in next action distribution
  • proof
  • ...and 8 more