Table of Contents
Fetching ...

A Simple and Adaptive Learning Rate for FTRL in Online Learning with Minimax Regret of $Θ(T^{2/3})$ and its Application to Best-of-Both-Worlds

Taira Tsuchiya, Shinji Ito

TL;DR

This paper tackles online learning for hard problems with minimax regret Θ(T^{2/3}) by introducing SPB-matching, a simple learning-rate framework that matches stability, penalty, and bias terms within FTRL. Leveraging Tsallis entropy regularization, it develops a Best-of-Both-Worlds algorithm framework that attains simultaneous stochastic and adversarial guarantees for challenging indirect-feedback settings: partial monitoring with global observability, graph bandits with weak observability, and multi-armed bandits with paid observations. The resulting bounds improve upon prior FTRL-based results and reveal an unexpectedly simple, tunable exponent α = 1 − 1/log k for Tsallis entropy to achieve optimal-ish dependencies on problem dimensions and time. The work provides a unified, MS-type bound approach across these hard problems and establishes the first MS-type BOBW bounds for graph bandits and PM with global observability, with potential broad applicability to other adversarial-stochastic hybrids.

Abstract

Follow-the-Regularized-Leader (FTRL) is a powerful framework for various online learning problems. By designing its regularizer and learning rate to be adaptive to past observations, FTRL is known to work adaptively to various properties of an underlying environment. However, most existing adaptive learning rates are for online learning problems with a minimax regret of $Θ(\sqrt{T})$ for the number of rounds $T$, and there are only a few studies on adaptive learning rates for problems with a minimax regret of $Θ(T^{2/3})$, which include several important problems dealing with indirect feedback. To address this limitation, we establish a new adaptive learning rate framework for problems with a minimax regret of $Θ(T^{2/3})$. Our learning rate is designed by matching the stability, penalty, and bias terms that naturally appear in regret upper bounds for problems with a minimax regret of $Θ(T^{2/3})$. As applications of this framework, we consider three major problems with a minimax regret of $Θ(T^{2/3})$: partial monitoring, graph bandits, and multi-armed bandits with paid observations. We show that FTRL with our learning rate and the Tsallis entropy regularizer improves existing Best-of-Both-Worlds (BOBW) regret upper bounds, which achieve simultaneous optimality in the stochastic and adversarial regimes. The resulting learning rate is surprisingly simple compared to the existing learning rates for BOBW algorithms for problems with a minimax regret of $Θ(T^{2/3})$.

A Simple and Adaptive Learning Rate for FTRL in Online Learning with Minimax Regret of $Θ(T^{2/3})$ and its Application to Best-of-Both-Worlds

TL;DR

This paper tackles online learning for hard problems with minimax regret Θ(T^{2/3}) by introducing SPB-matching, a simple learning-rate framework that matches stability, penalty, and bias terms within FTRL. Leveraging Tsallis entropy regularization, it develops a Best-of-Both-Worlds algorithm framework that attains simultaneous stochastic and adversarial guarantees for challenging indirect-feedback settings: partial monitoring with global observability, graph bandits with weak observability, and multi-armed bandits with paid observations. The resulting bounds improve upon prior FTRL-based results and reveal an unexpectedly simple, tunable exponent α = 1 − 1/log k for Tsallis entropy to achieve optimal-ish dependencies on problem dimensions and time. The work provides a unified, MS-type bound approach across these hard problems and establishes the first MS-type BOBW bounds for graph bandits and PM with global observability, with potential broad applicability to other adversarial-stochastic hybrids.

Abstract

Follow-the-Regularized-Leader (FTRL) is a powerful framework for various online learning problems. By designing its regularizer and learning rate to be adaptive to past observations, FTRL is known to work adaptively to various properties of an underlying environment. However, most existing adaptive learning rates are for online learning problems with a minimax regret of for the number of rounds , and there are only a few studies on adaptive learning rates for problems with a minimax regret of , which include several important problems dealing with indirect feedback. To address this limitation, we establish a new adaptive learning rate framework for problems with a minimax regret of . Our learning rate is designed by matching the stability, penalty, and bias terms that naturally appear in regret upper bounds for problems with a minimax regret of . As applications of this framework, we consider three major problems with a minimax regret of : partial monitoring, graph bandits, and multi-armed bandits with paid observations. We show that FTRL with our learning rate and the Tsallis entropy regularizer improves existing Best-of-Both-Worlds (BOBW) regret upper bounds, which achieve simultaneous optimality in the stochastic and adversarial regimes. The resulting learning rate is surprisingly simple compared to the existing learning rates for BOBW algorithms for problems with a minimax regret of .
Paper Structure (39 sections, 16 theorems, 108 equations, 1 table, 2 algorithms)

This paper contains 39 sections, 16 theorems, 108 equations, 1 table, 2 algorithms.

Key Result

Theorem 1

There exists learning rate $(\beta_t)_t$ and exploration rate $(\gamma_t)_t$ for which the RHS of eq:bound_1_intro is bounded by $O([)]{ ([)]{ \sum_{t=1}^T \sqrt{z_t {h}_{t} \log(\varepsilon T)} }^{2/3} + ([)]{ {\sqrt{z_{\max} {h}_{\max}}}/{\varepsilon} }^{2/3} }$ for any $\varepsilon \geq 1/T$, whe

Theorems & Definitions (25)

  • Theorem 1: informal version of \ref{['thm:F_upper_final']}
  • Theorem 2: informal version of \ref{['thm:main_bobw']}
  • Definition 3
  • Lemma 4
  • Lemma 5
  • Theorem 6
  • Theorem 7
  • Theorem 8
  • Corollary 9
  • Theorem 10
  • ...and 15 more