Nearly Tight Bounds for Cross-Learning Contextual Bandits with Graphical Feedback
Ruiyuan Huang, Zengfeng Huang
TL;DR
This work studies cross-learning contextual bandits with graphical feedback, where a feedback graph allows observing neighboring losses across contexts; the central question asks whether one can attain $\tilde{O}(\sqrt{\alpha T})$ regret, independent of the number of contexts. The authors design algorithms that exploit the stochasticity of contexts and apply adversarial-bandit techniques within a cross-learning, graph-aware framework, achieving $\tilde{O}(\sqrt{\alpha T})$ regret for both adversarial losses and stochastic contexts, and extending to the unknown-context-distribution case via epoch-based loss estimation. The main technical contributions include a per-context regret decomposition, a known-distribution analysis using importance weighting and FTRL, and a novel unknown-distribution algorithm with empirical importance estimation and rejection sampling, supported by extended concentration lemmas in the graphical setting. This yields nearly minimax-optimal performance for cross-learning contextual bandits with graphical feedback, advancing the understanding of feedback structure in contextual bandits and offering practical implications for auction bidding and related applications where graphical feedback is present.
Abstract
The cross-learning contextual bandit problem with graphical feedback has recently attracted significant attention. In this setting, there is a contextual bandit with a feedback graph over the arms, and pulling an arm reveals the loss for all neighboring arms in the feedback graph across all contexts. Initially proposed by Han et al. (2024), this problem has broad applications in areas such as bidding in first price auctions, and explores a novel frontier in the feedback structure of bandit problems. A key theoretical question is whether an algorithm with $\widetilde{O}(\sqrt{αT})$ regret exists, where $α$ represents the independence number of the feedback graph. This question is particularly interesting because it concerns whether an algorithm can achieve a regret bound entirely independent of the number of contexts and matching the minimax regret of vanilla graphical bandits. Previous work has demonstrated that such an algorithm is impossible for adversarial contexts, but the question remains open for stochastic contexts. In this work, we affirmatively answer this open question by presenting an algorithm that achieves the minimax $\widetilde{O}(\sqrt{αT})$ regret for cross-learning contextual bandits with graphical feedback and stochastic contexts. Notably, although that question is open even for stochastic bandits, we directly solve the strictly stronger adversarial bandit version of the problem.
