Table of Contents
Fetching ...

Thompson Sampling in Partially Observable Contextual Bandits

Hongju Park, Mohamad Kazem Shirani Faradonbeh

TL;DR

This work extends Thompson sampling to contextual bandits where contexts are only partially observed via a linear sensing process, introducing the transformed parameter $oldsymbol{ta}_i = D^{ op}oldsymbol{mu}_i$ and an observation model $y_i(t)=A x_i(t)+oldsymbol{8i}(t)$. It derives high-probability, instance-dependent regret bounds and square-root consistency for the arm-specific parameter setting; the results are supported by novel martingale-based concentration inequalities tailored to partially observed dependent data. The paper also provides detailed proofs outlines and comprehensive numerical experiments, including synthetic simulations and real healthcare datasets, illustrating near-optimal performance of Thompson sampling under partial observations. Overall, the approach broadens the applicability of Bayesian bandit strategies to settings with imperfect contextual information while preserving strong theoretical guarantees. The work also highlights problem-dependent information, margin conditions, and self-normalized processes as key components in achieving poly-log regret.

Abstract

Contextual bandits constitute a classical framework for decision-making under uncertainty. In this setting, the goal is to learn the arms of highest reward subject to contextual information, while the unknown reward parameters of each arm need to be learned by experimenting that specific arm. Accordingly, a fundamental problem is that of balancing exploration (i.e., pulling different arms to learn their parameters), versus exploitation (i.e., pulling the best arms to gain reward). To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partial context observations remains unexplored to date, despite being theoretically more general and practically more versatile. We study bandit policies for learning to select optimal arms based on the data of observations, which are noisy linear functions of the unobserved context vectors. Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation. Specifically, we establish the followings: (i) regret bounds that grow poly-logarithmically with time, (ii) square-root consistency of parameter estimation, and (iii) scaling of the regret with other quantities including dimensions and number of arms. Extensive numerical experiments with both real and synthetic data are presented as well, corroborating the efficacy of Thompson sampling. To establish the results, we introduce novel martingale techniques and concentration inequalities to address partially observed dependent random variables generated from unspecified distributions, and also leverage problem-dependent information to sharpen probabilistic bounds for time-varying suboptimality gaps. These techniques pave the road towards studying other decision-making problems with contextual information as well as partial observations.

Thompson Sampling in Partially Observable Contextual Bandits

TL;DR

This work extends Thompson sampling to contextual bandits where contexts are only partially observed via a linear sensing process, introducing the transformed parameter and an observation model . It derives high-probability, instance-dependent regret bounds and square-root consistency for the arm-specific parameter setting; the results are supported by novel martingale-based concentration inequalities tailored to partially observed dependent data. The paper also provides detailed proofs outlines and comprehensive numerical experiments, including synthetic simulations and real healthcare datasets, illustrating near-optimal performance of Thompson sampling under partial observations. Overall, the approach broadens the applicability of Bayesian bandit strategies to settings with imperfect contextual information while preserving strong theoretical guarantees. The work also highlights problem-dependent information, margin conditions, and self-normalized processes as key components in achieving poly-log regret.

Abstract

Contextual bandits constitute a classical framework for decision-making under uncertainty. In this setting, the goal is to learn the arms of highest reward subject to contextual information, while the unknown reward parameters of each arm need to be learned by experimenting that specific arm. Accordingly, a fundamental problem is that of balancing exploration (i.e., pulling different arms to learn their parameters), versus exploitation (i.e., pulling the best arms to gain reward). To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partial context observations remains unexplored to date, despite being theoretically more general and practically more versatile. We study bandit policies for learning to select optimal arms based on the data of observations, which are noisy linear functions of the unobserved context vectors. Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation. Specifically, we establish the followings: (i) regret bounds that grow poly-logarithmically with time, (ii) square-root consistency of parameter estimation, and (iii) scaling of the regret with other quantities including dimensions and number of arms. Extensive numerical experiments with both real and synthetic data are presented as well, corroborating the efficacy of Thompson sampling. To establish the results, we introduce novel martingale techniques and concentration inequalities to address partially observed dependent random variables generated from unspecified distributions, and also leverage problem-dependent information to sharpen probabilistic bounds for time-varying suboptimality gaps. These techniques pave the road towards studying other decision-making problems with contextual information as well as partial observations.
Paper Structure (16 sections, 20 theorems, 199 equations, 6 figures, 2 algorithms)

This paper contains 16 sections, 20 theorems, 199 equations, 6 figures, 2 algorithms.

Key Result

Theorem 1

Let $\eta_i$ and $\widehat{\eta}_i(t)$ be the transformed true parameter in eq:etai and its estimate in eq:etahatc, respectively. Then, with probability at least $1-\delta$, Algorithm algo1 guarantees for all arms $i\in [N]$ and at all times $t$ in the range $\tau_i^{(1)} <t\leq T$, where $R = \sqrt{R_1^2 + R_2^2}$ and $\tau_i^{(1)}=\mathcal{O}(p_i^{-2}Nd_y^{3.5}\kappa^{-5}\log^{5}(TNd_y/\delta))

Figures (6)

  • Figure 1: Plots of $\mathrm{Regret}(t)/(\log t)^2$ over time for the different dimensions of context at $N = 5$ and $d_y=10,20,40,80$. The solid and dashed lines represent the average-case and worst-case regret curves, respectively.
  • Figure 2: Plots of normalized estimation errors $\sqrt{t}\|\widehat{\eta}_i(t)-\eta_i\|$ of Algorithm \ref{['algo1']} over time for partially observable stochastic contextual bandits with five arm-specific parameters and dimensions of observations and contexts $d_y=20$, $d_x=10,~20,~40$.
  • Figure 3: Plots of regrets over time with the different number of arms $N = 10,~20,~30$ for Thomson sampling versus the Greedy algorithm. The solid and dashed lines represent the average-case and worst-case regret curves, respectively.
  • Figure 4: Plots of average correct decision rates of the regression oracle and Thompson sampling for Eye movement (left) and EGG dataset (right).
  • Figure 5: Plots of average correct decision rates of the regression oracle and Thompson sampling for Eye Movement (top left) and EGG dataset (top right) under the simple linear regression setup and Eye Movement (bottom left) and EGG dataset (bottom right) under the logistic linear regression setup.
  • ...and 1 more figures

Theorems & Definitions (39)

  • Remark 1
  • Definition 1: Optimality Region
  • Remark 2
  • Theorem 1: Partial Estimation Accuracy
  • Theorem 2
  • Corollary 1
  • Remark 3
  • Corollary 2
  • Lemma 1
  • proof
  • ...and 29 more