Table of Contents
Fetching ...

Nearly Tight Bounds for Cross-Learning Contextual Bandits with Graphical Feedback

Ruiyuan Huang, Zengfeng Huang

TL;DR

This work studies cross-learning contextual bandits with graphical feedback, where a feedback graph allows observing neighboring losses across contexts; the central question asks whether one can attain $\tilde{O}(\sqrt{\alpha T})$ regret, independent of the number of contexts. The authors design algorithms that exploit the stochasticity of contexts and apply adversarial-bandit techniques within a cross-learning, graph-aware framework, achieving $\tilde{O}(\sqrt{\alpha T})$ regret for both adversarial losses and stochastic contexts, and extending to the unknown-context-distribution case via epoch-based loss estimation. The main technical contributions include a per-context regret decomposition, a known-distribution analysis using importance weighting and FTRL, and a novel unknown-distribution algorithm with empirical importance estimation and rejection sampling, supported by extended concentration lemmas in the graphical setting. This yields nearly minimax-optimal performance for cross-learning contextual bandits with graphical feedback, advancing the understanding of feedback structure in contextual bandits and offering practical implications for auction bidding and related applications where graphical feedback is present.

Abstract

The cross-learning contextual bandit problem with graphical feedback has recently attracted significant attention. In this setting, there is a contextual bandit with a feedback graph over the arms, and pulling an arm reveals the loss for all neighboring arms in the feedback graph across all contexts. Initially proposed by Han et al. (2024), this problem has broad applications in areas such as bidding in first price auctions, and explores a novel frontier in the feedback structure of bandit problems. A key theoretical question is whether an algorithm with $\widetilde{O}(\sqrt{αT})$ regret exists, where $α$ represents the independence number of the feedback graph. This question is particularly interesting because it concerns whether an algorithm can achieve a regret bound entirely independent of the number of contexts and matching the minimax regret of vanilla graphical bandits. Previous work has demonstrated that such an algorithm is impossible for adversarial contexts, but the question remains open for stochastic contexts. In this work, we affirmatively answer this open question by presenting an algorithm that achieves the minimax $\widetilde{O}(\sqrt{αT})$ regret for cross-learning contextual bandits with graphical feedback and stochastic contexts. Notably, although that question is open even for stochastic bandits, we directly solve the strictly stronger adversarial bandit version of the problem.

Nearly Tight Bounds for Cross-Learning Contextual Bandits with Graphical Feedback

TL;DR

This work studies cross-learning contextual bandits with graphical feedback, where a feedback graph allows observing neighboring losses across contexts; the central question asks whether one can attain regret, independent of the number of contexts. The authors design algorithms that exploit the stochasticity of contexts and apply adversarial-bandit techniques within a cross-learning, graph-aware framework, achieving regret for both adversarial losses and stochastic contexts, and extending to the unknown-context-distribution case via epoch-based loss estimation. The main technical contributions include a per-context regret decomposition, a known-distribution analysis using importance weighting and FTRL, and a novel unknown-distribution algorithm with empirical importance estimation and rejection sampling, supported by extended concentration lemmas in the graphical setting. This yields nearly minimax-optimal performance for cross-learning contextual bandits with graphical feedback, advancing the understanding of feedback structure in contextual bandits and offering practical implications for auction bidding and related applications where graphical feedback is present.

Abstract

The cross-learning contextual bandit problem with graphical feedback has recently attracted significant attention. In this setting, there is a contextual bandit with a feedback graph over the arms, and pulling an arm reveals the loss for all neighboring arms in the feedback graph across all contexts. Initially proposed by Han et al. (2024), this problem has broad applications in areas such as bidding in first price auctions, and explores a novel frontier in the feedback structure of bandit problems. A key theoretical question is whether an algorithm with regret exists, where represents the independence number of the feedback graph. This question is particularly interesting because it concerns whether an algorithm can achieve a regret bound entirely independent of the number of contexts and matching the minimax regret of vanilla graphical bandits. Previous work has demonstrated that such an algorithm is impossible for adversarial contexts, but the question remains open for stochastic contexts. In this work, we affirmatively answer this open question by presenting an algorithm that achieves the minimax regret for cross-learning contextual bandits with graphical feedback and stochastic contexts. Notably, although that question is open even for stochastic bandits, we directly solve the strictly stronger adversarial bandit version of the problem.

Paper Structure

This paper contains 19 sections, 10 theorems, 69 equations, 1 figure, 2 algorithms.

Key Result

Theorem 1

For $\iota=2 \log (8 K T^2), L=\sqrt{\frac{\iota \alpha T}{\log (K)}}=\widetilde{\Theta}(\sqrt{\alpha T}), \gamma=\frac{16 \iota}{L}=\widetilde{\Theta}(1 / \sqrt{\alpha T})$, and $\eta=\frac{\gamma}{2(2 L \gamma+\iota)}=\widetilde{\Theta}(1 / \sqrt{\alpha T})$, alg:unknown yields a regret bound of

Figures (1)

  • Figure 1: A figure in Sch23. Here we use it to illustrate the timeline of \ref{['alg:unknown']}. At the end of epoch $\mathop{\mathrm{\mathcal{T}}}\nolimits_e$, the snapshot $s_{e+2}$ is fixed. The contexts within epoch $\mathop{\mathrm{\mathcal{T}}}\nolimits_{e}$ are used to compute loss estimators for epoch $\mathop{\mathrm{\mathcal{T}}}\nolimits_{e+1}$, which are fed to the FTRL sub-algorithm.

Theorems & Definitions (18)

  • Theorem 1
  • Lemma 2: Freedman's Inequality
  • Lemma 3: Lemma 5, GraphAlon15
  • Lemma 4: Lemma 11, Sch23
  • Definition 5
  • Lemma 6: Extension of Lemma 6 in Sch23
  • proof
  • Lemma 7: Extension of Lemma 7 in Sch23
  • proof
  • Definition 8
  • ...and 8 more