High Probability Bound for Cross-Learning Contextual Bandits with Unknown Context Distributions

Ruiyuan Huang; Zengfeng Huang

High Probability Bound for Cross-Learning Contextual Bandits with Unknown Context Distributions

Ruiyuan Huang, Zengfeng Huang

TL;DR

This work studies cross-learning contextual bandits with adversarial losses and unknown context distributions, aiming for high-probability regret guarantees. Building on the EXP3-type algorithm of Schneider and Zimmert (2023), it develops a deeper, epoch-aware analysis that exploits weak dependencies across epochs and introduces a surrogate loss sequence to enable tight martingale concentration. The authors prove a high-probability regret bound of $\widetilde{O}(\sqrt{KT\log(1/\delta)})$ for any policy, matching the order of the known expected bound under unknown context distribution. This advances theoretical guarantees for learning in settings like online bidding and sleeping bandits where context distributions are not known in advance and losses can be adversarial.

Abstract

Motivated by applications in online bidding and sleeping bandits, we examine the problem of contextual bandits with cross learning, where the learner observes the loss associated with the action across all possible contexts, not just the current round's context. Our focus is on a setting where losses are chosen adversarially, and contexts are sampled i.i.d. from a specific distribution. This problem was first studied by Balseiro et al. (2019), who proposed an algorithm that achieves near-optimal regret under the assumption that the context distribution is known in advance. However, this assumption is often unrealistic. To address this issue, Schneider and Zimmert (2023) recently proposed a new algorithm that achieves nearly optimal expected regret. It is well-known that expected regret can be significantly weaker than high-probability bounds. In this paper, we present a novel, in-depth analysis of their algorithm and demonstrate that it actually achieves near-optimal regret with high probability. There are steps in the original analysis by Schneider and Zimmert (2023) that lead only to an expected bound by nature. In our analysis, we introduce several new insights. Specifically, we make extensive use of the weak dependency structure between different epochs, which was overlooked in previous analyses. Additionally, standard martingale inequalities are not directly applicable, so we refine martingale inequalities to complete our analysis.

High Probability Bound for Cross-Learning Contextual Bandits with Unknown Context Distributions

TL;DR

for any policy, matching the order of the known expected bound under unknown context distribution. This advances theoretical guarantees for learning in settings like online bidding and sleeping bandits where context distributions are not known in advance and losses can be adversarial.

Abstract

Paper Structure (23 sections, 8 theorems, 98 equations, 1 algorithm)

This paper contains 23 sections, 8 theorems, 98 equations, 1 algorithm.

Introduction
Technical Overview
Related Works
Problem Statament
The Algorithm in Schneider and Zimmert (2023)
Intuition behind Schneider and Zimmert (2023)
A Formal Description of the Algorithm in Schneider and Zimmert (2023)
Main Result and Analysis
Regret Decomposition
Identifying a Prototypical Term
Analysis of the Prototypical Term
Conclusions
Useful Lemmas
Detailed Proof of \ref{['thm:high_probability_main']}
Decomposition
...and 8 more sections

Key Result

Theorem 1

For any $\delta \in (0,1)$, alg:main with parameters choice $\iota=2 \log (8 K T \frac{1}{\delta})$, $L=\sqrt{\frac{\iota K T}{\log (K)}}=\widetilde{\Theta}(\sqrt{K T \log\frac{1}{\delta}} )$, $\gamma=\frac{16 \iota}{L}=\widetilde{\Theta}(\sqrt{\frac{\log(1/\delta)}{K T}})$, and $\eta=\frac{\gamma}{ with probability at least $1 - \delta$ for any policy $\pi$.

Theorems & Definitions (13)

Theorem 1: Formal
Lemma 1: Freedman's Inequality
Definition 1
Lemma 2: Lemma 6 and Lemma 7, Zimmert2023
Lemma 3: Lemma 8, Zimmert2023
Definition 2
Lemma 4: Lemma 9, Zimmert2023
Definition 3
Lemma 5: Lemma 10, Zimmert2023
Lemma 6
...and 3 more

High Probability Bound for Cross-Learning Contextual Bandits with Unknown Context Distributions

TL;DR

Abstract

High Probability Bound for Cross-Learning Contextual Bandits with Unknown Context Distributions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (13)