Table of Contents
Fetching ...

Stochastic Graph Bandit Learning with Side-Observations

Xueping Gong, Jiheng Zhang

TL;DR

This work addresses stochastic contextual bandits with general function classes under time-varying directed graph feedback, introducing Ada-G a practical algorithm that adapts to reward gaps and graph structures without needing graph-parameter information. It establishes gap-dependent regret bounds via the policy disagreement coefficient and independence-numbered graph parameters, and proves minimax optimality up to logarithmic factors. The method relies on offline regression oracles, epoch-based confidence sets, and a graph-aware exploration scheme that uses induced subgraphs to bound exploration to the independence number, achieving efficient performance in easy and hard instances. Empirical results on synthetic graphs and a real-world network (Flixster) demonstrate substantial regret reductions and confirm the framework's practicality and robustness in leveraging side-observations for contextual decision-making.

Abstract

In this paper, we investigate the stochastic contextual bandit with general function space and graph feedback. We propose an algorithm that addresses this problem by adapting to both the underlying graph structures and reward gaps. To the best of our knowledge, our algorithm is the first to provide a gap-dependent upper bound in this stochastic setting, bridging the research gap left by the work in [35]. In comparison to [31,33,35], our method offers improved regret upper bounds and does not require knowledge of graphical quantities. We conduct numerical experiments to demonstrate the computational efficiency and effectiveness of our approach in terms of regret upper bounds. These findings highlight the significance of our algorithm in advancing the field of stochastic contextual bandits with graph feedback, opening up avenues for practical applications in various domains.

Stochastic Graph Bandit Learning with Side-Observations

TL;DR

This work addresses stochastic contextual bandits with general function classes under time-varying directed graph feedback, introducing Ada-G a practical algorithm that adapts to reward gaps and graph structures without needing graph-parameter information. It establishes gap-dependent regret bounds via the policy disagreement coefficient and independence-numbered graph parameters, and proves minimax optimality up to logarithmic factors. The method relies on offline regression oracles, epoch-based confidence sets, and a graph-aware exploration scheme that uses induced subgraphs to bound exploration to the independence number, achieving efficient performance in easy and hard instances. Empirical results on synthetic graphs and a real-world network (Flixster) demonstrate substantial regret reductions and confirm the framework's practicality and robustness in leveraging side-observations for contextual decision-making.

Abstract

In this paper, we investigate the stochastic contextual bandit with general function space and graph feedback. We propose an algorithm that addresses this problem by adapting to both the underlying graph structures and reward gaps. To the best of our knowledge, our algorithm is the first to provide a gap-dependent upper bound in this stochastic setting, bridging the research gap left by the work in [35]. In comparison to [31,33,35], our method offers improved regret upper bounds and does not require knowledge of graphical quantities. We conduct numerical experiments to demonstrate the computational efficiency and effectiveness of our approach in terms of regret upper bounds. These findings highlight the significance of our algorithm in advancing the field of stochastic contextual bandits with graph feedback, opening up avenues for practical applications in various domains.
Paper Structure (16 sections, 11 theorems, 61 equations, 5 figures, 1 table, 3 algorithms)

This paper contains 16 sections, 11 theorems, 61 equations, 5 figures, 1 table, 3 algorithms.

Key Result

Lemma 3.1

(Implicit Optimization Problem). For all epoch $m$ and all rounds $t$ in epoch $m$, $Q_t$ is a feasible solution to the following implicit optimization problem:

Figures (5)

  • Figure 1: An example on k-tree.
  • Figure 2: Comparison with baselines and our algorithms in graph feedback setting. We conduct numerical experiments on k-tree, clique-group and random graphs, respectively. The top and bottom dashed curves represent the curves obtained by adding and subtracting one standard deviation to the regret curve of the corresponding color, respectively.
  • Figure 3: Regret curves on Flixster social network.
  • Figure 4: Different types of graphs in order: a fully connected graph, a clique group, a k-tree, a random graph.
  • Figure 5: Regret curves on clique groups.

Theorems & Definitions (13)

  • Definition 2.1
  • Remark 3.1
  • Lemma 3.1
  • Theorem 3.1
  • Theorem 3.2
  • Lemma B.1: CBwithOracleinstanceCB_RL
  • Proposition B.1
  • Lemma B.2
  • Lemma B.3
  • Lemma B.4
  • ...and 3 more