Table of Contents
Fetching ...

Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis

Sharu Theresa Jose, Shana Moothedath

TL;DR

This work tackles stochastic contextual bandits with noisy contexts and an unknown channel parameter by developing a fully Bayesian Thompson Sampling algorithm that includes a de-noising step to form predictive posteriors over the true context. The authors derive information-theoretic Bayesian regret bounds showing $O(d\sqrt{T})$ growth in the Gaussian-linear setting and extend the analysis to delayed-context scenarios, where the regret scales favorably as $O(\sqrt{T m \log T})$ under suitable conditions. The approach hinges on expressing the regret in terms of KL-divergence between true and approximate posteriors and mutual information terms that capture the information gained about the channel and context. Empirical results on synthetic and real data demonstrate sublinear regret and competitive performance against oracle baselines, validating both the theoretical bounds and practical utility. Overall, the paper provides principled regret guarantees and a robust Bayesian TS framework for noisy-context contextual bandits with unknown noise channels and delayed feedback, with potential applicability to real-world recommendation and control systems.

Abstract

We explore a stochastic contextual linear bandit problem where the agent observes a noisy, corrupted version of the true context through a noise channel with an unknown noise parameter. Our objective is to design an action policy that can approximate" that of an oracle, which has access to the reward model, the channel parameter, and the predictive distribution of the true context from the observed noisy context. In a Bayesian framework, we introduce a Thompson sampling algorithm for Gaussian bandits with Gaussian context noise. Adopting an information-theoretic analysis, we demonstrate the Bayesian regret of our algorithm concerning the oracle's action policy. We also extend this problem to a scenario where the agent observes the true context with some delay after receiving the reward and show that delayed true contexts lead to lower Bayesian regret. Finally, we empirically demonstrate the performance of the proposed algorithms against baselines.

Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis

TL;DR

This work tackles stochastic contextual bandits with noisy contexts and an unknown channel parameter by developing a fully Bayesian Thompson Sampling algorithm that includes a de-noising step to form predictive posteriors over the true context. The authors derive information-theoretic Bayesian regret bounds showing growth in the Gaussian-linear setting and extend the analysis to delayed-context scenarios, where the regret scales favorably as under suitable conditions. The approach hinges on expressing the regret in terms of KL-divergence between true and approximate posteriors and mutual information terms that capture the information gained about the channel and context. Empirical results on synthetic and real data demonstrate sublinear regret and competitive performance against oracle baselines, validating both the theoretical bounds and practical utility. Overall, the paper provides principled regret guarantees and a robust Bayesian TS framework for noisy-context contextual bandits with unknown noise channels and delayed feedback, with potential applicability to real-world recommendation and control systems.

Abstract

We explore a stochastic contextual linear bandit problem where the agent observes a noisy, corrupted version of the true context through a noise channel with an unknown noise parameter. Our objective is to design an action policy that can approximate" that of an oracle, which has access to the reward model, the channel parameter, and the predictive distribution of the true context from the observed noisy context. In a Bayesian framework, we introduce a Thompson sampling algorithm for Gaussian bandits with Gaussian context noise. Adopting an information-theoretic analysis, we demonstrate the Bayesian regret of our algorithm concerning the oracle's action policy. We also extend this problem to a scenario where the agent observes the true context with some delay after receiving the reward and show that delayed true contexts lead to lower Bayesian regret. Finally, we empirically demonstrate the performance of the proposed algorithms against baselines.
Paper Structure (38 sections, 8 theorems, 149 equations, 2 figures, 1 table, 3 algorithms)

This paper contains 38 sections, 8 theorems, 149 equations, 2 figures, 1 table, 3 algorithms.

Key Result

Lemma 3.1

Under Assumption assum:1, the following upper bound holds if $\frac{\lambda}{\sigma^2}\leq \frac{d}{T} \leq 1$, where $D_t =\mathbb{E}[D_{\rm KL}(P_t(\theta^{*})\Vert \bar{P}_t(\theta^{*}))]$ and In particular, if the feature map $\phi(a,c)=G(a)c$ with $G(a)$ being a $d \times d$ matrix satisfying Assumption assum:1 with $m=d$,

Figures (2)

  • Figure 1: Comparison of Bayesian regret of proposed algorithms with baselines as a function of number of iterations. (Left): Gaussian bandits with $K=40$, $\sigma^2_n=\sigma^2_{\gamma}=1.1$; (Center) Logistic bandits with $K=40$, $\sigma^2_n=2$, $\sigma^2_{\gamma}=2.5$; (Right) MovieLens dataset with added Gaussian context noise and Gaussian prior: parameters set as $\sigma^2_n=0.1$, $\sigma^2_{\gamma}=0.6$.
  • Figure 2: Bayesian cumulative regret of Algorithm 1 as a function of iterations over varying number $K$ of actions.

Theorems & Definitions (11)

  • Lemma 3.1
  • Lemma 3.2
  • Theorem 3.1
  • Lemma 4.1
  • Lemma 4.2
  • Theorem 4.1
  • Definition A.1: Sub-Gaussian Random Variable
  • Lemma A.1: Change of Measure Inequality
  • proof
  • Lemma A.2
  • ...and 1 more