Table of Contents
Fetching ...

Data-dependent Bounds with $T$-Optimal Best-of-Both-Worlds Guarantees in Multi-Armed Bandits using Stability-Penalty Matching

Quan Nguyen, Shinji Ito, Junpei Komiyama, Nishant A. Mehta

TL;DR

Real-time SPM obtains bounds with worst-case guarantees of order $O(\sqrt{T})$ in the adversarial regime and $O(\ln{T})$ in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses.

Abstract

Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal $O(\sqrt{T\ln{T}})$ worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new method for obtaining regret bounds that are simultaneously data-dependent, best-of-both-worlds and $T$-optimal for multi-armed bandits problems. In particular, we show that real-time SPM obtains bounds with worst-case guarantees of order $O(\sqrt{T})$ in the adversarial regime and $O(\ln{T})$ in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses. Our results are obtained by extending the SPM technique for tuning the learning rates in the follow-the-regularized-leader (FTRL) framework, which further indicates that the combination of SPM and FTRL is a promising approach for proving new adaptive bounds in online learning problems.

Data-dependent Bounds with $T$-Optimal Best-of-Both-Worlds Guarantees in Multi-Armed Bandits using Stability-Penalty Matching

TL;DR

Real-time SPM obtains bounds with worst-case guarantees of order in the adversarial regime and in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses.

Abstract

Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new method for obtaining regret bounds that are simultaneously data-dependent, best-of-both-worlds and -optimal for multi-armed bandits problems. In particular, we show that real-time SPM obtains bounds with worst-case guarantees of order in the adversarial regime and in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses. Our results are obtained by extending the SPM technique for tuning the learning rates in the follow-the-regularized-leader (FTRL) framework, which further indicates that the combination of SPM and FTRL is a promising approach for proving new adaptive bounds in online learning problems.

Paper Structure

This paper contains 31 sections, 35 theorems, 215 equations, 1 table, 4 algorithms.

Key Result

Lemma 2

For any $T \geq 1, z_{1:T} \geq 0, h_{1:T} > 0$ and a sequence $\beta_{1:T}$ defined by Equation eq:SPM, let $F(z_{1:T}, h_{1:T}) = \sum_{t=1}^T \frac{z_t}{\beta_t}$ and $G(z_{1:T}, h_{1:T}) = \sum_{t=1}^T \frac{z_t}{\sqrt{\sum_{s=1}^t \frac{z_s}{h_s}}}$. We have

Theorems & Definitions (39)

  • Definition 1
  • Lemma 2
  • Theorem 3
  • Remark 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Remark 8
  • Theorem 9
  • Remark 10
  • ...and 29 more