Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback

Ofir Schlisselberg; Tal Lancewicki; Peter Auer; Yishay Mansour

Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback

Ofir Schlisselberg, Tal Lancewicki, Peter Auer, Yishay Mansour

TL;DR

The paper tackles delayed-feedback multi-armed bandits in the Best-of-Both-Worlds (BoBW) setting, aiming to perform optimally in both stochastic and adversarial environments without prior knowledge of the regime. It introduces a Delayed SAPO-based algorithm that adaptively starts in a stochastic-like mode and switches to an adversarial algorithm when needed, using BSC and EAP to detect regime and manage sampling of eliminated arms. The main results achieve a near-optimal adversarial regret $\tilde{O}(\sqrt{KT} + \sqrt{D})$ and a stochastic regret of $O\left(\sum_{i:\Delta_i>0}\frac{\log T}{\Delta_i} + \frac{1}{K}\sum_i \Delta_i \sigma_{max}\right)$, with additional improvements such as removing the $\Phi^*$ term and tightening the dependence on $K$ in the stochastic regime. This work substantially closes the gap to lower bounds for delayed BoBW, offering a robust, adaptive framework with potential applicability to broader delayed-online learning contexts.

Abstract

We study the multi-armed bandit problem with adversarially chosen delays in the Best-of-Both-Worlds (BoBW) framework, which aims to achieve near-optimal performance in both stochastic and adversarial environments. While prior work has made progress toward this goal, existing algorithms suffer from significant gaps to the known lower bounds, especially in the stochastic settings. Our main contribution is a new algorithm that, up to logarithmic factors, matches the known lower bounds in each setting individually. In the adversarial case, our algorithm achieves regret of $\widetilde{O}(\sqrt{KT} + \sqrt{D})$, which is optimal up to logarithmic terms, where $T$ is the number of rounds, $K$ is the number of arms, and $D$ is the cumulative delay. In the stochastic case, we provide a regret bound which scale as $\sum_{i:Δ_i>0}\left(\log T/Δ_i\right) + \frac{1}{K}\sum Δ_i σ_{max}$, where $Δ_i$ is the sub-optimality gap of arm $i$ and $σ_{\max}$ is the maximum number of missing observations. To the best of our knowledge, this is the first BoBW algorithm to simultaneously match the lower bounds in both stochastic and adversarial regimes in delayed environment. Moreover, even beyond the BoBW setting, our stochastic regret bound is the first to match the known lower bound under adversarial delays, improving the second term over the best known result by a factor of $K$.

Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback

TL;DR

Abstract

Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (42)