Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis
Sharu Theresa Jose, Shana Moothedath
TL;DR
This work tackles stochastic contextual bandits with noisy contexts and an unknown channel parameter by developing a fully Bayesian Thompson Sampling algorithm that includes a de-noising step to form predictive posteriors over the true context. The authors derive information-theoretic Bayesian regret bounds showing $O(d\sqrt{T})$ growth in the Gaussian-linear setting and extend the analysis to delayed-context scenarios, where the regret scales favorably as $O(\sqrt{T m \log T})$ under suitable conditions. The approach hinges on expressing the regret in terms of KL-divergence between true and approximate posteriors and mutual information terms that capture the information gained about the channel and context. Empirical results on synthetic and real data demonstrate sublinear regret and competitive performance against oracle baselines, validating both the theoretical bounds and practical utility. Overall, the paper provides principled regret guarantees and a robust Bayesian TS framework for noisy-context contextual bandits with unknown noise channels and delayed feedback, with potential applicability to real-world recommendation and control systems.
Abstract
We explore a stochastic contextual linear bandit problem where the agent observes a noisy, corrupted version of the true context through a noise channel with an unknown noise parameter. Our objective is to design an action policy that can approximate" that of an oracle, which has access to the reward model, the channel parameter, and the predictive distribution of the true context from the observed noisy context. In a Bayesian framework, we introduce a Thompson sampling algorithm for Gaussian bandits with Gaussian context noise. Adopting an information-theoretic analysis, we demonstrate the Bayesian regret of our algorithm concerning the oracle's action policy. We also extend this problem to a scenario where the agent observes the true context with some delay after receiving the reward and show that delayed true contexts lead to lower Bayesian regret. Finally, we empirically demonstrate the performance of the proposed algorithms against baselines.
