Table of Contents
Fetching ...

Non-Stationary Lipschitz Bandits

Nicolas Nguyen, Solenne Gaucher, Claire Vernade

TL;DR

This work studies non-stationary Lipschitz bandits over a continuous action space, where the reward function $\mu_t$ can change arbitrarily in time yet remains Lipschitz in the action. The authors introduce MDBE, a multi-depth bin elimination algorithm that discretizes the action space hierarchically, runs replays at multiple scales, and evicts suboptimal regions to adaptively track significant shifts without prior knowledge of non-stationarity. They prove minimax-optimal dynamic regret bounds $\mathbb{E}[R(\pi_{MDBE},T)] = \widetilde{O}(\tilde{L}^{1/3} T^{2/3})$, and provide matching lower bounds, along with extensions to Hölder and multi-dimensional settings. Theoretical results are complemented by discussions of extensions, lower bounds, and potential future work on scalability and practical deployment. Overall, this work delivers the first optimal guarantees for non-stationary Lipschitz bandits and introduces a versatile, scale-aware adaptation mechanism for continuous-action exploration.

Abstract

We study the problem of non-stationary Lipschitz bandits, where the number of actions is infinite and the reward function, satisfying a Lipschitz assumption, can change arbitrarily over time. We design an algorithm that adaptively tracks the recently introduced notion of significant shifts, defined by large deviations of the cumulative reward function. To detect such reward changes, our algorithm leverages a hierarchical discretization of the action space. Without requiring any prior knowledge of the non-stationarity, our algorithm achieves a minimax-optimal dynamic regret bound of $\mathcal{\widetilde{O}}(\tilde{L}^{1/3}T^{2/3})$, where $\tilde{L}$ is the number of significant shifts and $T$ the horizon. This result provides the first optimal guarantee in this setting.

Non-Stationary Lipschitz Bandits

TL;DR

This work studies non-stationary Lipschitz bandits over a continuous action space, where the reward function can change arbitrarily in time yet remains Lipschitz in the action. The authors introduce MDBE, a multi-depth bin elimination algorithm that discretizes the action space hierarchically, runs replays at multiple scales, and evicts suboptimal regions to adaptively track significant shifts without prior knowledge of non-stationarity. They prove minimax-optimal dynamic regret bounds , and provide matching lower bounds, along with extensions to Hölder and multi-dimensional settings. Theoretical results are complemented by discussions of extensions, lower bounds, and potential future work on scalability and practical deployment. Overall, this work delivers the first optimal guarantees for non-stationary Lipschitz bandits and introduces a versatile, scale-aware adaptation mechanism for continuous-action exploration.

Abstract

We study the problem of non-stationary Lipschitz bandits, where the number of actions is infinite and the reward function, satisfying a Lipschitz assumption, can change arbitrarily over time. We design an algorithm that adaptively tracks the recently introduced notion of significant shifts, defined by large deviations of the cumulative reward function. To detect such reward changes, our algorithm leverages a hierarchical discretization of the action space. Without requiring any prior knowledge of the non-stationarity, our algorithm achieves a minimax-optimal dynamic regret bound of , where is the number of significant shifts and the horizon. This result provides the first optimal guarantee in this setting.

Paper Structure

This paper contains 33 sections, 16 theorems, 184 equations, 4 figures, 2 tables, 3 algorithms.

Key Result

Proposition 1

If a bin $B \in \mathcal{T}_d$ incurs significant regret on an interval $[s_1, s_2]$ with $s_2 - s_1 \leq 8^d$, then every point $x \in B$ also incurs significant regret over $[s_1, s_2]$.

Figures (4)

  • Figure 1: Example of sampling with $m=4$; active bins are in blue. Left: At time $t_1$, depths $3$ and $m$ are active. A sample path may select bin $B_{3,3}$ uniformly at random (u.a.r.) at depth $3$, then $B_{m,6}$u.a.r. among its active children, then arm $x_t$u.a.r. in $B_{m,6}$ (red path). Center: At $t_1 + 1$, a replay starts at depth $1$. A path may go through $B_{1,1} \rightarrow B_{3,2} \rightarrow B_{m,4}$, selecting $x_t$ in $B_{m,4}$ (red path). Alternatively, $B_{1,2}$ could be chosen; with no active children, $x_t$ is sampled directly from it (green choice). Right: At $t_1 + 9$, depth $1$ exits replay. Bin has been $B_{3,2}$ eliminated during the replay, and a path may select $B_{3,1} \rightarrow B_{m,1}$, then $x_t$ in $B_{m,1}$ (red path).
  • Figure 2: For a given depth $d$, partition of the rounds of a block where $d_0(t)=d$. Any of these blue intervals should be initialized by the start of a replay at this depth, i.e.$R_{s,d}=1$.
  • Figure 3: Reward functions $m^1$ (left) and $m^3$ (right) when $K = 5$.
  • Figure 4: Cumulative dynamic regret of MDBE, BinningUCB (naive), and BinningUCB (oracle) over a total horizon of $T = 10^6$ rounds with $10$ significant shifts. Results are averaged over $100$ independent runs, with $95\%$ confidence intervals of the mean dynamic regret shown.

Theorems & Definitions (41)

  • Definition 1: Significant regret of an arm $\bm{x}$
  • Definition 2: Significant shift, significant phase
  • Definition 3: Depth and Dyadic tree
  • Definition 4: Significant regret for a bin
  • Proposition 1: Significant regret of a bin implies significant regret of an action
  • Theorem 1: Lower bound on the dynamic regret
  • Theorem 2: Adaptive upper bound on dynamic regret
  • Remark 1: Beyond $1$-Lipschitz bandits
  • Corollary 1: Regret bound in terms of total variation $V_T$
  • Proposition 2: Concentration event
  • ...and 31 more