Table of Contents
Fetching ...

Planning and Learning in Risk-Aware Restless Multi-Arm Bandit Problem

Nima Akbarzadeh, Yossiri Adulyasak, Erick Delage

TL;DR

This work generalizes the traditional restless multi-arm bandit problem with a risk-neutral objective with a risk-neutral objective by incorporating risk-awareness and establishes indexability conditions for the case of a risk-aware objective and provides a solution based on Whittle index.

Abstract

In restless multi-arm bandits, a central agent is tasked with optimally distributing limited resources across several bandits (arms), with each arm being a Markov decision process. In this work, we generalize the traditional restless multi-arm bandit problem with a risk-neutral objective by incorporating risk-awareness. We establish indexability conditions for the case of a risk-aware objective and provide a solution based on Whittle index. In addition, we address the learning problem when the true transition probabilities are unknown by proposing a Thompson sampling approach and show that it achieves bounded regret that scales sublinearly with the number of episodes and quadratically with the number of arms. The efficacy of our method in reducing risk exposure in restless multi-arm bandits is illustrated through a set of numerical experiments in the contexts of machine replacement and patient scheduling applications under both planning and learning setups.

Planning and Learning in Risk-Aware Restless Multi-Arm Bandit Problem

TL;DR

This work generalizes the traditional restless multi-arm bandit problem with a risk-neutral objective with a risk-neutral objective by incorporating risk-awareness and establishes indexability conditions for the case of a risk-aware objective and provides a solution based on Whittle index.

Abstract

In restless multi-arm bandits, a central agent is tasked with optimally distributing limited resources across several bandits (arms), with each arm being a Markov decision process. In this work, we generalize the traditional restless multi-arm bandit problem with a risk-neutral objective by incorporating risk-awareness. We establish indexability conditions for the case of a risk-aware objective and provide a solution based on Whittle index. In addition, we address the learning problem when the true transition probabilities are unknown by proposing a Thompson sampling approach and show that it achieves bounded regret that scales sublinearly with the number of episodes and quadratically with the number of arms. The efficacy of our method in reducing risk exposure in restless multi-arm bandits is illustrated through a set of numerical experiments in the contexts of machine replacement and patient scheduling applications under both planning and learning setups.

Paper Structure

This paper contains 22 sections, 9 theorems, 42 equations, 3 figures, 2 tables, 3 algorithms.

Key Result

Proposition 2

Let $\tilde{\pi}_\lambda^{i*}$ be an optimal Markovian policy for the augmented arm risk-neutral MDP. Then, one can construct an optimal policy for the relaxation of Problem prob:risk-aware using: Namely, $\max_{\bar{\pi}^i\in\bar{\Pi}_H} \bar{D}^i_{\lambda, x^i_0}(\bar{\pi}^i) = \bar{D}^i_{\lambda, x^i_0}(\bar{\pi}_\lambda^{i*})$ where $\bar{\Pi}_H$ is the set of all history-dependent policies.

Figures (3)

  • Figure 1: Sample plots of the three utility functions.
  • Figure 2: Figure (a) shows the distribution of relative improvements in the objective function achieved by our proposed policy compared to the risk-neutral one in $6804$ different setups. This histogram is limited to range of values up to $100$. Figure (b) illustrates the distribution of total rewards under both risk-aware and risk-free policies for for one of the arms with the setup $T=5$, $|{\cal X}|=5$, $N=25$, $M=7$, $\alpha=1$, $\tau=0.5$. The red line is set at the target $\tau$.
  • Figure 3: The plot shows $\mathcal{R}(k)$ of RAWIP on the left and $\mathcal{R}(k)/K$ on the right for three different setups when the utility function is $\alpha = 1$. For all these experiments, $T=5$ and $M = 1$, $\tau = 0.5$. These plots are averaged over $100$ sample paths. The transition model for figures $(a) \& (b)$ is according to model 3 of le2016structural and $(c) \& (d)$ and $(e) \& (f)$ are according to $\mathcal{P}$.

Theorems & Definitions (14)

  • Definition 1: Indexability and Whittle index
  • Proposition 2
  • Lemma 4
  • Theorem 5
  • Remark 6
  • Lemma 7
  • Theorem 8
  • Remark 9
  • Proposition 10
  • Lemma 11: Lemma 1 of russo2014learning
  • ...and 4 more