Optimal cross-learning for contextual bandits with unknown context distributions

Jon Schneider; Julian Zimmert

Optimal cross-learning for contextual bandits with unknown context distributions

Jon Schneider, Julian Zimmert

TL;DR

At the core of the algorithm is a novel technique for coordinating the execution of a learning algorithm over multiple epochs in such a way to remove correlations between estimation of the unknown distribution and the actions played by the algorithm.

Abstract

We consider the problem of designing contextual bandit algorithms in the ``cross-learning'' setting of Balseiro et al., where the learner observes the loss for the action they play in all possible contexts, not just the context of the current round. We specifically consider the setting where losses are chosen adversarially and contexts are sampled i.i.d. from an unknown distribution. In this setting, we resolve an open problem of Balseiro et al. by providing an efficient algorithm with a nearly tight (up to logarithmic factors) regret bound of $\widetilde{O}(\sqrt{TK})$, independent of the number of contexts. As a consequence, we obtain the first nearly tight regret bounds for the problems of learning to bid in first-price auctions (under unknown value distributions) and sleeping bandits with a stochastic action set. At the core of our algorithm is a novel technique for coordinating the execution of a learning algorithm over multiple epochs in such a way to remove correlations between estimation of the unknown distribution and the actions played by the algorithm. This technique may be of independent interest for other learning problems involving estimation of an unknown context distribution.

Optimal cross-learning for contextual bandits with unknown context distributions

TL;DR

Abstract

, independent of the number of contexts. As a consequence, we obtain the first nearly tight regret bounds for the problems of learning to bid in first-price auctions (under unknown value distributions) and sleeping bandits with a stochastic action set. At the core of our algorithm is a novel technique for coordinating the execution of a learning algorithm over multiple epochs in such a way to remove correlations between estimation of the unknown distribution and the actions played by the algorithm. This technique may be of independent interest for other learning problems involving estimation of an unknown context distribution.

Paper Structure (24 sections, 15 theorems, 62 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 24 sections, 15 theorems, 62 equations, 1 figure, 1 table, 1 algorithm.

Introduction
Techniques
Applications
Preliminaries
Notation
Non-uniform action sets
Challenges to extending existing algorithms
Our techniques
Avoiding high-probability bounds.
Increasing the number of i.i.d. samples.
Main result and analysis
The algorithm
Computational efficiency.
Analysis overview
Applications
...and 9 more sections

Key Result

Lemma 1

Let $X_1,\dots, X_t$ be i.i.d. samples from a distribution $\nu$ over $[0,1]$ with mean $\mu$, and let $\widehat{\mu}=\frac{1}{t}\sum_{s=1}^tX_s$ denote the empirical mean. Then

Figures (1)

Figure 1: Illustration of the timeline of alg: cross learning ftrl. At the end of epoch $\mathop{\mathrm{\mathcal{T}}}\nolimits_e$, the snapshot $s_{e+2}$ is fixed. The contexts within epoch $\mathop{\mathrm{\mathcal{T}}}\nolimits_{e}$ are used to compute loss estimators for epoch $\mathop{\mathrm{\mathcal{T}}}\nolimits_{e+1}$, which are fed to the FTRL sub-algorithm.

Theorems & Definitions (29)

Lemma 1
Theorem 1
Corollary 1
proof
Corollary 2
Lemma 2: Lemma 12.2 banditbook
Lemma 3: Bernstein type inequality banditbook exercise 5.15
Lemma 4: hazan2016introduction, Theorem 1.5
Lemma 5
proof
...and 19 more

Optimal cross-learning for contextual bandits with unknown context distributions

TL;DR

Abstract

Optimal cross-learning for contextual bandits with unknown context distributions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (29)