Table of Contents
Fetching ...

Rising Rested MAB with Linear Drift

Omer Amichay, Yishay Mansour

TL;DR

This work studies a non-stationary rising rested MAB where each arm's reward mean grows linearly with the number of pulls, formalized as $\mu_i(n)=L_i n+b_i$ with $L_i\ge0$. The authors prove a tight regret bound of $\tilde{\Theta}(T^{4/5}K^{3/5})$ and provide both upper and lower bounds, including instance-dependent refinements. They introduce the R-ed-EE explore-exploit algorithm, achieving $O\left(T^{4/5}(\Phi K)^{3/5}\ln(\Phi KT)^{1/5}\right)$ regret, and two instance-dependent algorithms, R-ed-AE and HR-re-AE, with bounds that adapt to problem parameters; they also establish a near-matching lower bound $\Omega(K^{3/5}T^{4/5})$. An important takeaway is that, unlike stationary stochastic MAB, the rising linear-drift setting incurs substantially higher regret, and the horizon-unknown case incurs linear regret even under favorable conditions. The results offer a principled understanding of exploration-exploitation in changing environments and pave the way for further study of hybrid rising/rotating drift models.

Abstract

We consider non-stationary multi-arm bandit (MAB) where the expected reward of each action follows a linear function of the number of times we executed the action. Our main result is a tight regret bound of $\tildeΘ(T^{4/5}K^{3/5})$, by providing both upper and lower bounds. We extend our results to derive instance dependent regret bounds, which depend on the unknown parametrization of the linear drift of the rewards.

Rising Rested MAB with Linear Drift

TL;DR

This work studies a non-stationary rising rested MAB where each arm's reward mean grows linearly with the number of pulls, formalized as with . The authors prove a tight regret bound of and provide both upper and lower bounds, including instance-dependent refinements. They introduce the R-ed-EE explore-exploit algorithm, achieving regret, and two instance-dependent algorithms, R-ed-AE and HR-re-AE, with bounds that adapt to problem parameters; they also establish a near-matching lower bound . An important takeaway is that, unlike stationary stochastic MAB, the rising linear-drift setting incurs substantially higher regret, and the horizon-unknown case incurs linear regret even under favorable conditions. The results offer a principled understanding of exploration-exploitation in changing environments and pave the way for further study of hybrid rising/rotating drift models.

Abstract

We consider non-stationary multi-arm bandit (MAB) where the expected reward of each action follows a linear function of the number of times we executed the action. Our main result is a tight regret bound of , by providing both upper and lower bounds. We extend our results to derive instance dependent regret bounds, which depend on the unknown parametrization of the linear drift of the rewards.
Paper Structure (44 sections, 22 theorems, 62 equations, 1 figure, 1 table, 3 algorithms)

This paper contains 44 sections, 22 theorems, 62 equations, 1 figure, 1 table, 3 algorithms.

Key Result

Corollary 2

For Rising Rested MAB with Linear Drift the dynamic regret is equal to the static regret. Namely, the optimal policy plays always arm $i^*.$

Figures (1)

  • Figure 1: Sample figure caption.

Theorems & Definitions (51)

  • Remark 1
  • Corollary 2
  • Definition 3
  • Lemma 4
  • Theorem 5
  • proof
  • Definition 6
  • Definition 7
  • Lemma 8
  • Lemma 9
  • ...and 41 more