Smooth Non-Stationary Bandits
Su Jia, Qian Xie, Nathan Kallus, Peter I. Frazier
TL;DR
A non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $\beta$-H\"older function, i.e., a function that is $(\beta-1)$-times Lipschitz-continuously differentiable, which shows the first separation between the smooth and non-smooth regimes.
Abstract
In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time. However, in practice, environments often change {\em smoothly}, so such algorithms may incur higher-than-necessary regret. We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $β$-Hölder function, i.e., a function that is $(β-1)$-times Lipschitz-continuously differentiable. The non-stationarity becomes more smooth as $β$ increases. When $β=1$, this corresponds to the non-smooth regime, where \cite{besbes2014stochastic} established a minimax regret of $\tilde Θ(T^{2/3})$. We show the first separation between the smooth (i.e., $β\ge 2$) and non-smooth (i.e., $β=1$) regimes by presenting a policy with $\tilde O(k^{4/5} T^{3/5})$ regret on any $k$-armed, $2$-Hölder instance. We complement this result by showing that the minimax regret on the $β$-Hölder family of instances is $Ω(T^{(β+1)/(2β+1)})$ for any integer $β\ge 1$. This matches our upper bound for $β=2$ up to logarithmic factors. Furthermore, we validated the effectiveness of our policy through a comprehensive numerical study using real-world click-through rate data.
