Smooth Non-Stationary Bandits

Su Jia; Qian Xie; Nathan Kallus; Peter I. Frazier

Smooth Non-Stationary Bandits

Su Jia, Qian Xie, Nathan Kallus, Peter I. Frazier

TL;DR

A non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $\beta$-H\"older function, i.e., a function that is $(\beta-1)$-times Lipschitz-continuously differentiable, which shows the first separation between the smooth and non-smooth regimes.

Abstract

In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time. However, in practice, environments often change {\em smoothly}, so such algorithms may incur higher-than-necessary regret. We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $β$-Hölder function, i.e., a function that is $(β-1)$-times Lipschitz-continuously differentiable. The non-stationarity becomes more smooth as $β$ increases. When $β=1$, this corresponds to the non-smooth regime, where \cite{besbes2014stochastic} established a minimax regret of $\tilde Θ(T^{2/3})$. We show the first separation between the smooth (i.e., $β\ge 2$) and non-smooth (i.e., $β=1$) regimes by presenting a policy with $\tilde O(k^{4/5} T^{3/5})$ regret on any $k$-armed, $2$-Hölder instance. We complement this result by showing that the minimax regret on the $β$-Hölder family of instances is $Ω(T^{(β+1)/(2β+1)})$ for any integer $β\ge 1$. This matches our upper bound for $β=2$ up to logarithmic factors. Furthermore, we validated the effectiveness of our policy through a comprehensive numerical study using real-world click-through rate data.

Smooth Non-Stationary Bandits

TL;DR

A non-stationary bandits problem where each arm's mean reward sequence can be embedded into a

-H\"older function, i.e., a function that is

-times Lipschitz-continuously differentiable, which shows the first separation between the smooth and non-smooth regimes.

Abstract

-Hölder function, i.e., a function that is

-times Lipschitz-continuously differentiable. The non-stationarity becomes more smooth as

increases. When

, this corresponds to the non-smooth regime, where \cite{besbes2014stochastic} established a minimax regret of

. We show the first separation between the smooth (i.e.,

) and non-smooth (i.e.,

) regimes by presenting a policy with

regret on any

-armed,

-Hölder instance. We complement this result by showing that the minimax regret on the

-Hölder family of instances is

for any integer

. This matches our upper bound for

up to logarithmic factors. Furthermore, we validated the effectiveness of our policy through a comprehensive numerical study using real-world click-through rate data.

Paper Structure (37 sections, 31 theorems, 101 equations, 7 figures, 2 algorithms)

This paper contains 37 sections, 31 theorems, 101 equations, 7 figures, 2 algorithms.

Introduction
Our Contributions
Formulation
The Hölder Class and Smooth Non-stationary Bandits
The Regret
Related Work
Lower Bounds
Building a Bowl
Definition of the Family $\cal F_\beta$
The Main Lower Bound
Upper Bounds for the One-armed Setting
The Budgeted Exploration Policy
Non-smooth Case: $\beta=1$
An $T^{3/5}$ Upper Bound for $\beta=2$
Multi-Armed Setting
...and 22 more sections

Key Result

Proposition 4.1

For any fixed integer $\beta\geq 1$, there exists a family $\{g_\varepsilon\}$ of $(\beta-1)$-times continuously differentiable function s where $g_\varepsilon$ is defined on $[0,\varepsilon]$, with (i) vanishing derivatives: $g^{(j)}_\varepsilon(0)=g^{(j)}_\varepsilon(\varepsilon)=0$ for any $j=1,\

Figures (7)

Figure 1: Illustration of $g_\varepsilon$ for $\beta = 4$: $g^{(3)}$ is a "flock" of pyramid-shaped function s. The function $g^{(2)}(x)$ is defined as the integration of $g^{(3)}$ from $0$ to $x$. Similarly, $g^{(1)}(x)$ is the integration of $g^{(2)}$ from $0$ to $x$. As the key property, any derivative function lower than order $3$ vanishes at the boundary points, i.e., $0$ and $4w$.
Figure 2: Construction of the family $\mathcal{F}_\beta$, illustrated in the case of $\beta=2$. The "snapshots" of the curves on the two epochs $[x_j, x_{j+1}]$ and $[x_{j+1}, x_{j+2}]$. For any combination of red or blue curves, the change at any endpoint is smooth - both red and blue have $0$ derivative at any $x_j$.
Figure 3: Log-log regret plot on synthetic data. To visualize how the regret $R$ of the policies (BE-NS, BE-S and Rexp3) scale in the length $T$ of the time horizon, we present a log-log (base 10) plot. Each data point represents the regret of a policy, averaged across $100$ randomly generated sinusoidal instances. We applied linear regression to the data points corresponding to each policy and obtained three linear curves, whose expressions are provided in the figure. The slopes of these curves align closely with the theoretical values. In particular, the regret of BE-S grows considerably more slowly than the benchmarks.
Figure 4: Modeling non-stationarity in the CTR using Yahoo! data. We first employ a rolling window average method on the Yahoo! user-click data to obtain a non-smooth function that represents the variations of CTR in time, as illustrated in the left subfigure. In the second part of our experiment, we smooth these functions using local regression, resulting in a mean reward sequence of length $8.64\times 10^7$, where each round corresponds to a second; see the right subfigure.
Figure 5: Visualization of the experimental results in the counterfactual setting.
...and 2 more figures

Theorems & Definitions (48)

Definition 2.1: Hölder Class
Definition 2.2: Smooth Non-stationary Instance
Definition 2.3: The Hölder Family
Definition 2.4: Regret
Proposition 4.1: Side of the Bowl
Definition 4.2: Construction of a Bowl
Definition 4.3: The Family $\mathcal{F}_\beta$
Theorem 4.4: Main Lower bound
Lemma 4.5: Likely to Select a Wrong Arm
Proposition 5.1: Generic Upper Bound, $\beta=1,k=1$
...and 38 more

Smooth Non-Stationary Bandits

TL;DR

Abstract

Smooth Non-Stationary Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (48)