Adaptive Learning Rate for Follow-the-Regularized-Leader: Competitive Analysis and Best-of-Both-Worlds

Shinji Ito; Taira Tsuchiya; Junya Honda

Adaptive Learning Rate for Follow-the-Regularized-Leader: Competitive Analysis and Best-of-Both-Worlds

Shinji Ito, Taira Tsuchiya, Junya Honda

TL;DR

This work introduces a competitive-analysis framework to adapt the Follow-The-Regularized-Leader learning rate online, formalizing the rate choice with F(β_{1:T}; z_{1:T}, h_{1:T}) and its offline optimum F^*. It proves a fundamental lower bound on the competitive ratio and provides a stability-penalty matching (SPM) update that achieves a matching upper bound up to a constant, with explicit, implementable update rules in terms of η_t. The authors show that the optimal CR is governed by approximate monotonicity of the penalty sequence h_{1:T}, enabling constant-factor CR for ξ-approximately non-increasing sequences and yielding tight performance guarantees. Leveraging these ideas, they construct Best-of-Both-Worlds bandit algorithms based on Tsallis-entropy regularizers that attain O(log T) regret in stochastic settings and O(√T) in adversarial settings, across multi-armed, graph, linear, and contextual bandits, with problem-dependent constants captured by quantities like ω(Δ) and independence numbers. This framework unifies adaptive learning-rate design with competitive analysis to deliver near-optimal, environment-agnostic regret bounds that scale favorably with problem structure.

Abstract

Follow-The-Regularized-Leader (FTRL) is known as an effective and versatile approach in online learning, where appropriate choice of the learning rate is crucial for smaller regret. To this end, we formulate the problem of adjusting FTRL's learning rate as a sequential decision-making problem and introduce the framework of competitive analysis. We establish a lower bound for the competitive ratio and propose update rules for learning rate that achieves an upper bound within a constant factor of this lower bound. Specifically, we illustrate that the optimal competitive ratio is characterized by the (approximate) monotonicity of components of the penalty term, showing that a constant competitive ratio is achievable if the components of the penalty term form a monotonically non-increasing sequence, and derive a tight competitive ratio when penalty terms are $ξ$-approximately monotone non-increasing. Our proposed update rule, referred to as \textit{stability-penalty matching}, also facilitates constructing the Best-Of-Both-Worlds (BOBW) algorithms for stochastic and adversarial environments. In these environments our result contributes to achieve tighter regret bound and broaden the applicability of algorithms for various settings such as multi-armed bandits, graph bandits, linear bandits, and contextual bandits.

Adaptive Learning Rate for Follow-the-Regularized-Leader: Competitive Analysis and Best-of-Both-Worlds

TL;DR

Abstract

-approximately monotone non-increasing. Our proposed update rule, referred to as \textit{stability-penalty matching}, also facilitates constructing the Best-Of-Both-Worlds (BOBW) algorithms for stochastic and adversarial environments. In these environments our result contributes to achieve tighter regret bound and broaden the applicability of algorithms for various settings such as multi-armed bandits, graph bandits, linear bandits, and contextual bandits.

Paper Structure (38 sections, 24 theorems, 155 equations, 2 tables, 1 algorithm)

This paper contains 38 sections, 24 theorems, 155 equations, 2 tables, 1 algorithm.

Introduction
Main contribution
Application: best-of-both-worlds regret bounds
Problem Setup
Stability-Penalty Matching
Application: best-of-both-worlds bandit algorithm
Algorithmic framework for best-of-both-worlds
Multi-armed bandit
Linear bandit
Additional Related Work
Online Learning using Tsallis entropy
Adaptive Learning Rate
Comparison of SPM learning rate against SPA learning rate
Lower Bound on the Competitive Ratio
Omitted Proofs in Sections \ref{['sec:setup']} and \ref{['sec:UB']}
...and 23 more sections

Key Result

Theorem 1

For any $T \in \mathbb{N}$, any $\xi \ge 1$, and for any policy $\pi = \{ \pi_t: (z_{1:t}, h_{1:t}) \mapsto \eta_t \}$, there exist $z_{1:T} \in \mathbb{R}_{\ge 0}^T$ and $h_{1:T} \in H_{\xi}^T$ such that $\mathrm{CR}(\pi; z_{1:T}, h_{1:T}) \ge \frac{\sqrt{T-1}}{\sqrt{T}+\xi} \sqrt{\xi}$.

Theorems & Definitions (48)

Theorem 1
Theorem 2
Remark 1
Definition 1
Lemma 1
Remark 2
Theorem 3
Corollary 1
Lemma 2
Lemma 3
...and 38 more

Adaptive Learning Rate for Follow-the-Regularized-Leader: Competitive Analysis and Best-of-Both-Worlds

TL;DR

Abstract

Adaptive Learning Rate for Follow-the-Regularized-Leader: Competitive Analysis and Best-of-Both-Worlds

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (48)