Stochastic Graph Bandit Learning with Side-Observations
Xueping Gong, Jiheng Zhang
TL;DR
This work addresses stochastic contextual bandits with general function classes under time-varying directed graph feedback, introducing Ada-G a practical algorithm that adapts to reward gaps and graph structures without needing graph-parameter information. It establishes gap-dependent regret bounds via the policy disagreement coefficient and independence-numbered graph parameters, and proves minimax optimality up to logarithmic factors. The method relies on offline regression oracles, epoch-based confidence sets, and a graph-aware exploration scheme that uses induced subgraphs to bound exploration to the independence number, achieving efficient performance in easy and hard instances. Empirical results on synthetic graphs and a real-world network (Flixster) demonstrate substantial regret reductions and confirm the framework's practicality and robustness in leveraging side-observations for contextual decision-making.
Abstract
In this paper, we investigate the stochastic contextual bandit with general function space and graph feedback. We propose an algorithm that addresses this problem by adapting to both the underlying graph structures and reward gaps. To the best of our knowledge, our algorithm is the first to provide a gap-dependent upper bound in this stochastic setting, bridging the research gap left by the work in [35]. In comparison to [31,33,35], our method offers improved regret upper bounds and does not require knowledge of graphical quantities. We conduct numerical experiments to demonstrate the computational efficiency and effectiveness of our approach in terms of regret upper bounds. These findings highlight the significance of our algorithm in advancing the field of stochastic contextual bandits with graph feedback, opening up avenues for practical applications in various domains.
