Adaptive Smooth Non-Stationary Bandits

Joe Suk

Adaptive Smooth Non-Stationary Bandits

Joe Suk

TL;DR

This work studies K-armed non-stationary bandits with rewards evolving smoothly in time under $(\beta,\lambda)$-Hölder conditions, unifying switching and variation-based models and deriving a minimax dynamic regret across all $K,\beta,\lambda$. It shows adaptive achievability without knowledge of $(\beta,\lambda)$ via the META algorithm, which detects significant shifts and restarts while maintaining near-optimal rates. The paper also introduces a gap-dependent analysis with a significant-shift oracle, revealing faster rates when a safe arm exists, and establishes a precise phase-transition threshold at $\max_n \lambda_n \le\sqrt{K/T}$ separating fast-gap regret from the worst-case $\sqrt{KT}$ regime. Overall, it provides sharp lower and upper bounds that adapt to smoothness and non-stationarity, offering actionable insights for designing adaptive bandit algorithms in non-stationary environments.

Abstract

We study a $K$-armed non-stationary bandit model where rewards change smoothly, as captured by Hölder class assumptions on rewards as functions of time. Such smooth changes are parametrized by a Hölder exponent $β$ and coefficient $λ$. While various sub-cases of this general model have been studied in isolation, we first establish the minimax dynamic regret rate generally for all $K,β,λ$. Next, we show this optimal dynamic regret can be attained adaptively, without knowledge of $β,λ$. To contrast, even with parameter knowledge, upper bounds were only previously known for limited regimes $β\leq 1$ and $β=2$ (Slivkins, 2014; Krishnamurthy and Gopalan, 2021; Manegueu et al., 2021; Jia et al.,2023). Thus, our work resolves open questions raised by these disparate threads of the literature. We also study the problem of attaining faster gap-dependent regret rates in non-stationary bandits. While such rates are long known to be impossible in general (Garivier and Moulines, 2011), we show that environments admitting a safe arm (Suk and Kpotufe, 2022) allow for much faster rates than the worst-case scaling with $\sqrt{T}$. While previous works in this direction focused on attaining the usual logarithmic regret bounds, as summed over stationary periods, our new gap-dependent rates reveal new optimistic regimes of non-stationarity where even the logarithmic bounds are pessimistic. We show our new gap-dependent rate is tight and that its achievability (i.e., as made possible by a safe arm) has a surprisingly simple and clean characterization within the smooth Hölder class model.

Adaptive Smooth Non-Stationary Bandits

TL;DR

This work studies K-armed non-stationary bandits with rewards evolving smoothly in time under

-Hölder conditions, unifying switching and variation-based models and deriving a minimax dynamic regret across all

. It shows adaptive achievability without knowledge of

via the META algorithm, which detects significant shifts and restarts while maintaining near-optimal rates. The paper also introduces a gap-dependent analysis with a significant-shift oracle, revealing faster rates when a safe arm exists, and establishes a precise phase-transition threshold at

separating fast-gap regret from the worst-case

regime. Overall, it provides sharp lower and upper bounds that adapt to smoothness and non-stationarity, offering actionable insights for designing adaptive bandit algorithms in non-stationary environments.

Abstract

We study a

-armed non-stationary bandit model where rewards change smoothly, as captured by Hölder class assumptions on rewards as functions of time. Such smooth changes are parametrized by a Hölder exponent

and coefficient

. While various sub-cases of this general model have been studied in isolation, we first establish the minimax dynamic regret rate generally for all

. Next, we show this optimal dynamic regret can be attained adaptively, without knowledge of

. To contrast, even with parameter knowledge, upper bounds were only previously known for limited regimes

and

(Slivkins, 2014; Krishnamurthy and Gopalan, 2021; Manegueu et al., 2021; Jia et al.,2023). Thus, our work resolves open questions raised by these disparate threads of the literature. We also study the problem of attaining faster gap-dependent regret rates in non-stationary bandits. While such rates are long known to be impossible in general (Garivier and Moulines, 2011), we show that environments admitting a safe arm (Suk and Kpotufe, 2022) allow for much faster rates than the worst-case scaling with

. While previous works in this direction focused on attaining the usual logarithmic regret bounds, as summed over stationary periods, our new gap-dependent rates reveal new optimistic regimes of non-stationarity where even the logarithmic bounds are pessimistic. We show our new gap-dependent rate is tight and that its achievability (i.e., as made possible by a safe arm) has a surprisingly simple and clean characterization within the smooth Hölder class model.

Paper Structure (18 sections, 8 theorems, 42 equations, 2 figures, 1 table)

This paper contains 18 sections, 8 theorems, 42 equations, 2 figures, 1 table.

Introduction
Further Discussion on Related Works
Contributions
Problem Setup
Preliminaries and Notation
Smooth Non-Stationary Bandits
Dynamic Regret Lower Bound
Dynamic Regret Upper Bound
More Details About META
Gap-Dependent Dynamic Regret Bounds
Related Work on Gap-Dependent Regret
Refined Regret Analysis of the Significant Shift Oracle
Properties of New Gap-Dependent Regret Rate (Proofs in Supplement)
Elimination Achieves Gap-Dependent Regret Rate in Safe Environments
Lower Bound for Gap-Dependent Regret Rate
...and 3 more sections

Key Result

Theorem 1

(Proof in app:lower) Fix $\beta, \lambda > 0$, $K \geq 2$, and $T \in \mathbb{N}$. For any algorithm $\pi$, there exists an environment $\mathcal{E} \in \Sigma(\beta,\lambda)$ such that the regret is lower bounded by

Figures (2)

Figure 1: An example of a non-stationary safe environment where no significant shift occurs because arm $3$ is safe throughout, maintaining small dynamic regret, even while being suboptimal at all times.
Figure 2: Shown are two replay durations $m= m_1\text{ or } m_2$ occurring roughly every ${\sqrt{M/m}}$ rounds following the random schedule of Line \ref{['line:add-replay']} of \ref{['meta-alg']}, where $M$ is the eventual length of an episode. Each replay (blue segment) aims to detect a $1/\sqrt{m}$ magnitude change, i.e., an average dynamic regret $\frac{1}{m}\sum_{t=1}^m \delta_t(a)$ of order $1/\sqrt{m}$. As a recursive procedure, the replays of Base-Alg form a parent-child relationship as depicted.

Theorems & Definitions (18)

Definition 1: Hölder Class Function
Definition 2: Hölder Gap Environments
Theorem 1
Remark 1
Definition 3
Theorem 2: Proof in \ref{['app:upper']}
Corollary 3
Remark 2
Definition 4
Remark 3
...and 8 more

Adaptive Smooth Non-Stationary Bandits

TL;DR

Abstract

Adaptive Smooth Non-Stationary Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (18)