Table of Contents
Fetching ...

Rising Rested Bandits: Lower Bounds and Efficient Algorithms

Marco Fiandri, Alberto Maria Metelli, Francesco Trov`o

TL;DR

An algorithm is designed for the rested case of the rested bandits in which the arms' expected reward is monotonically non-decreasing and concave, providing a regret bound depending on the properties of the instance and, under certain circumstances, of $\widetilde{\mathcal{O}}(T^{\frac{2}{3}})$.

Abstract

This paper is in the field of stochastic Multi-Armed Bandits (MABs), i.e. those sequential selection techniques able to learn online using only the feedback given by the chosen option (a.k.a. $arm$). We study a particular case of the rested bandits in which the arms' expected reward is monotonically non-decreasing and concave. We study the inherent sample complexity of the regret minimization problem by deriving suitable regret lower bounds. Then, we design an algorithm for the rested case $\textit{R-ed-UCB}$, providing a regret bound depending on the properties of the instance and, under certain circumstances, of $\widetilde{\mathcal{O}}(T^{\frac{2}{3}})$. We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset

Rising Rested Bandits: Lower Bounds and Efficient Algorithms

TL;DR

An algorithm is designed for the rested case of the rested bandits in which the arms' expected reward is monotonically non-decreasing and concave, providing a regret bound depending on the properties of the instance and, under certain circumstances, of .

Abstract

This paper is in the field of stochastic Multi-Armed Bandits (MABs), i.e. those sequential selection techniques able to learn online using only the feedback given by the chosen option (a.k.a. ). We study a particular case of the rested bandits in which the arms' expected reward is monotonically non-decreasing and concave. We study the inherent sample complexity of the regret minimization problem by deriving suitable regret lower bounds. Then, we design an algorithm for the rested case , providing a regret bound depending on the properties of the instance and, under certain circumstances, of . We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset

Paper Structure

This paper contains 40 sections, 30 theorems, 149 equations, 14 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

Let $\pi^\star_{\bm{\mu},T}$ be the oracle constant policy: Then, $\pi^\star_{\bm{\mu},T}$ is optimal for the rested non-decreasing bandits (i.e., under Assumption ass:incr). We will denote with $\pi^\star_{\bm{\mu},T}(t) \equiv : i^\star(T)$ the optimal constant arm.

Figures (14)

  • Figure 1: The two 2-armed instances (A on the left, B on the right) of rested bandits with non-decreasing expected rewards used in the proof of Theorem \ref{['thr:nonLearnable']}.
  • Figure 2: The two 2-armed instances (A on the left, B on the right) of rested bandits with non-decreasing expected rewards used in the proof of Theorem \ref{['thr:nonLearnable2']}.
  • Figure 3: The two 2-armed instances (A on the left, B on the right) of rested bandits with non-decreasing expected rewards used in the proof of Corollary \ref{['thr:Learnablebeta']}.
  • Figure 4: The two 2-armed instances (A on the left, B on the right) of rested bandits with non-decreasing expected rewards used in the proof of Theorem \ref{['thr:nonLearnable3']}.
  • Figure 5: Graphical representation of the estimator construction $\overline{\mu}_i^{\text{R-ed\@\xspace}}(t)$ for the rested deterministic setting.
  • ...and 9 more figures

Theorems & Definitions (49)

  • Theorem 3.1: heidari2016tight
  • Lemma 1: Regret Decomposition
  • Theorem 4.1: Non-Learnability under Assumption \ref{['ass:incr']}
  • Theorem 4.2: Non-Learnability under Assumptions \ref{['ass:incr']} and \ref{['ass:decrDeriv']}
  • Corollary 1: Learnability under Assumptions \ref{['ass:incr']} and \ref{['ass:decrDeriv']}
  • Theorem 4.3: Lower Bound under Assumptions \ref{['ass:incr']} and \ref{['ass:decrDeriv']}
  • Theorem 4.4: $\Upsilon_{\bm{\mu}}$-Dependent Regret Lower Bound under Assumptions \ref{['ass:incr']} and \ref{['ass:decrDeriv']}
  • Theorem 5.1
  • Theorem 5.2
  • Theorem A.1: heidari2016tight
  • ...and 39 more