Table of Contents
Fetching ...

Smooth Non-Stationary Bandits

Su Jia, Qian Xie, Nathan Kallus, Peter I. Frazier

TL;DR

A non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $\beta$-H\"older function, i.e., a function that is $(\beta-1)$-times Lipschitz-continuously differentiable, which shows the first separation between the smooth and non-smooth regimes.

Abstract

In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time. However, in practice, environments often change {\em smoothly}, so such algorithms may incur higher-than-necessary regret. We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $β$-Hölder function, i.e., a function that is $(β-1)$-times Lipschitz-continuously differentiable. The non-stationarity becomes more smooth as $β$ increases. When $β=1$, this corresponds to the non-smooth regime, where \cite{besbes2014stochastic} established a minimax regret of $\tilde Θ(T^{2/3})$. We show the first separation between the smooth (i.e., $β\ge 2$) and non-smooth (i.e., $β=1$) regimes by presenting a policy with $\tilde O(k^{4/5} T^{3/5})$ regret on any $k$-armed, $2$-Hölder instance. We complement this result by showing that the minimax regret on the $β$-Hölder family of instances is $Ω(T^{(β+1)/(2β+1)})$ for any integer $β\ge 1$. This matches our upper bound for $β=2$ up to logarithmic factors. Furthermore, we validated the effectiveness of our policy through a comprehensive numerical study using real-world click-through rate data.

Smooth Non-Stationary Bandits

TL;DR

A non-stationary bandits problem where each arm's mean reward sequence can be embedded into a -H\"older function, i.e., a function that is -times Lipschitz-continuously differentiable, which shows the first separation between the smooth and non-smooth regimes.

Abstract

In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time. However, in practice, environments often change {\em smoothly}, so such algorithms may incur higher-than-necessary regret. We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a -Hölder function, i.e., a function that is -times Lipschitz-continuously differentiable. The non-stationarity becomes more smooth as increases. When , this corresponds to the non-smooth regime, where \cite{besbes2014stochastic} established a minimax regret of . We show the first separation between the smooth (i.e., ) and non-smooth (i.e., ) regimes by presenting a policy with regret on any -armed, -Hölder instance. We complement this result by showing that the minimax regret on the -Hölder family of instances is for any integer . This matches our upper bound for up to logarithmic factors. Furthermore, we validated the effectiveness of our policy through a comprehensive numerical study using real-world click-through rate data.
Paper Structure (37 sections, 31 theorems, 101 equations, 7 figures, 2 algorithms)

This paper contains 37 sections, 31 theorems, 101 equations, 7 figures, 2 algorithms.

Key Result

Proposition 4.1

For any fixed integer $\beta\geq 1$, there exists a family $\{g_\varepsilon\}$ of $(\beta-1)$-times continuously differentiable function s where $g_\varepsilon$ is defined on $[0,\varepsilon]$, with (i) vanishing derivatives: $g^{(j)}_\varepsilon(0)=g^{(j)}_\varepsilon(\varepsilon)=0$ for any $j=1,\

Figures (7)

  • Figure 1: Illustration of $g_\varepsilon$ for $\beta = 4$: $g^{(3)}$ is a "flock" of pyramid-shaped function s. The function $g^{(2)}(x)$ is defined as the integration of $g^{(3)}$ from $0$ to $x$. Similarly, $g^{(1)}(x)$ is the integration of $g^{(2)}$ from $0$ to $x$. As the key property, any derivative function lower than order $3$ vanishes at the boundary points, i.e., $0$ and $4w$.
  • Figure 2: Construction of the family $\mathcal{F}_\beta$, illustrated in the case of $\beta=2$. The "snapshots" of the curves on the two epochs $[x_j, x_{j+1}]$ and $[x_{j+1}, x_{j+2}]$. For any combination of red or blue curves, the change at any endpoint is smooth - both red and blue have $0$ derivative at any $x_j$.
  • Figure 3: Log-log regret plot on synthetic data. To visualize how the regret $R$ of the policies (BE-NS, BE-S and Rexp3) scale in the length $T$ of the time horizon, we present a log-log (base 10) plot. Each data point represents the regret of a policy, averaged across $100$ randomly generated sinusoidal instances. We applied linear regression to the data points corresponding to each policy and obtained three linear curves, whose expressions are provided in the figure. The slopes of these curves align closely with the theoretical values. In particular, the regret of BE-S grows considerably more slowly than the benchmarks.
  • Figure 4: Modeling non-stationarity in the CTR using Yahoo! data. We first employ a rolling window average method on the Yahoo! user-click data to obtain a non-smooth function that represents the variations of CTR in time, as illustrated in the left subfigure. In the second part of our experiment, we smooth these functions using local regression, resulting in a mean reward sequence of length $8.64\times 10^7$, where each round corresponds to a second; see the right subfigure.
  • Figure 5: Visualization of the experimental results in the counterfactual setting.
  • ...and 2 more figures

Theorems & Definitions (48)

  • Definition 2.1: Hölder Class
  • Definition 2.2: Smooth Non-stationary Instance
  • Definition 2.3: The Hölder Family
  • Definition 2.4: Regret
  • Proposition 4.1: Side of the Bowl
  • Definition 4.2: Construction of a Bowl
  • Definition 4.3: The Family $\mathcal{F}_\beta$
  • Theorem 4.4: Main Lower bound
  • Lemma 4.5: Likely to Select a Wrong Arm
  • Proposition 5.1: Generic Upper Bound, $\beta=1,k=1$
  • ...and 38 more