Adaptive Regret for Bandits Made Possible: Two Queries Suffice

Zhou Lu; Qiuyi Zhang; Xinyi Chen; Fred Zhang; David Woodruff; Elad Hazan

Adaptive Regret for Bandits Made Possible: Two Queries Suffice

Zhou Lu, Qiuyi Zhang, Xinyi Chen, Fred Zhang, David Woodruff, Elad Hazan

TL;DR

This work studies strongly adaptive regret in adversarial online learning under a limited query model. It introduces StABL, a two-query bandit learner that combines an EXP3-based meta-algorithm with EXP3 base learners and a carefully designed observation distribution to produce unbiased loss estimators with controlled variance, achieving a near-optimal adaptive regret of $ ilde{O}( obreak oot 2 obreak obreak{n|I|})$ for multi-armed bandits. The authors extend the approach to bandit convex optimization, showing that three queries suffice to attain $ ilde{O}( oot 2{I})$ adaptive regret, and discuss potential two-query improvements via linear surrogate losses. Empirical results on volatile environments and downstream tasks like hyperparameter optimization demonstrate the practical advantages of the proposed methods for rapid adaptation with limited observations. The work sharpens the understanding of query-efficiency in adaptive regret and offers a concrete algorithmic framework with strong theoretical guarantees and empirical validation.

Abstract

Fast changing states or volatile environments pose a significant challenge to online optimization, which needs to perform rapid adaptation under limited observation. In this paper, we give query and regret optimal bandit algorithms under the strict notion of strongly adaptive regret, which measures the maximum regret over any contiguous interval $I$. Due to its worst-case nature, there is an almost-linear $Ω(|I|^{1-ε})$ regret lower bound, when only one query per round is allowed [Daniely el al, ICML 2015]. Surprisingly, with just two queries per round, we give Strongly Adaptive Bandit Learner (StABL) that achieves $\tilde{O}(\sqrt{n|I|})$ adaptive regret for multi-armed bandits with $n$ arms. The bound is tight and cannot be improved in general. Our algorithm leverages a multiplicative update scheme of varying stepsizes and a carefully chosen observation distribution to control the variance. Furthermore, we extend our results and provide optimal algorithms in the bandit convex optimization setting. Finally, we empirically demonstrate the superior performance of our algorithms under volatile environments and for downstream tasks, such as algorithm selection for hyperparameter optimization.

Adaptive Regret for Bandits Made Possible: Two Queries Suffice

TL;DR

for multi-armed bandits. The authors extend the approach to bandit convex optimization, showing that three queries suffice to attain

adaptive regret, and discuss potential two-query improvements via linear surrogate losses. Empirical results on volatile environments and downstream tasks like hyperparameter optimization demonstrate the practical advantages of the proposed methods for rapid adaptation with limited observations. The work sharpens the understanding of query-efficiency in adaptive regret and offers a concrete algorithmic framework with strong theoretical guarantees and empirical validation.

Abstract

. Due to its worst-case nature, there is an almost-linear

regret lower bound, when only one query per round is allowed [Daniely el al, ICML 2015]. Surprisingly, with just two queries per round, we give Strongly Adaptive Bandit Learner (StABL) that achieves

adaptive regret for multi-armed bandits with

arms. The bound is tight and cannot be improved in general. Our algorithm leverages a multiplicative update scheme of varying stepsizes and a carefully chosen observation distribution to control the variance. Furthermore, we extend our results and provide optimal algorithms in the bandit convex optimization setting. Finally, we empirically demonstrate the superior performance of our algorithms under volatile environments and for downstream tasks, such as algorithm selection for hyperparameter optimization.

Paper Structure (27 sections, 4 theorems, 41 equations, 3 figures, 1 table, 4 algorithms)

This paper contains 27 sections, 4 theorems, 41 equations, 3 figures, 1 table, 4 algorithms.

Introduction
Our Results
Related Work
Adaptive Regret Minimization
Dynamic Regret
Bandit Convex Optimization
Settings and Preliminaries
The Query Model
The EXP3 Algorithm
Standard Framework for Minimizing Adaptive Regret
Adaptive Regret in Multi-Armed Bandits
Proof Sketch
Adaptive Regret in the BCO Setting
Experiments
Learning from Expert Advice
...and 12 more sections

Key Result

Theorem 1

For the multi-armed bandits problem with $n$ arms and $T$ rounds, Algorithm alg bandit achieves an expected adaptive regret bound of $O\left(\sqrt{nI\log n } \log^{1.5} T\right)$, using two queries per round.

Figures (3)

Figure 1: Comparison plots of the algorithm rewards in the learning with expert advice setting. The right subfigure shows the performance of the algorithms when the best arm changes at random intervals, and demonstrates the advantage of using base algorithms with varying history lengths
Figure 2: Algorithm comparison plots of the log objective (lower is better) and the performance profile score against the Uniform baseline (higher is better) for minimizing the 32-dimensional SPHERE across 1000 trials.
Figure 3: Further comparison plots of the algorithm rewards in the learning with expert advice setting.

Theorems & Definitions (7)

Theorem 1: Adaptive regret minimization for multi-armed bandits
Lemma 2: Regret for EXP3
Theorem 3
proof
Lemma 4
proof
proof

Adaptive Regret for Bandits Made Possible: Two Queries Suffice

TL;DR

Abstract

Adaptive Regret for Bandits Made Possible: Two Queries Suffice

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (7)