Table of Contents
Fetching ...

Faster Q-Learning Algorithms for Restless Bandits

Parvish Kakarapalli, Devendra Kayande, Rahul Meshram

TL;DR

The Whittle index learning algorithm for restless multi-armed bandits (RMAB) is studied and it is illustrated using numerical example that Q-learning with UCB exploration policy has faster convergence and PhaseQL with UCB have fastest convergence rate.

Abstract

We study the Whittle index learning algorithm for restless multi-armed bandits (RMAB). We first present Q-learning algorithm and its variants -- speedy Q-learning (SQL), generalized speedy Q-learning (GSQL) and phase Q-learning (PhaseQL). We also discuss exploration policies -- $ε$-greedy and Upper confidence bound (UCB). We extend the study of Q-learning and its variants with UCB policy. We illustrate using numerical example that Q-learning with UCB exploration policy has faster convergence and PhaseQL with UCB have fastest convergence rate. We next extend the study of Q-learning variants for index learning to RMAB. The algorithm of index learning is two-timescale variant of stochastic approximation, on slower timescale we update index learning scheme and on faster timescale we update Q-learning assuming fixed index value. We study constant stepsizes two timescale stochastic approximation algorithm. We describe the performance of our algorithms using numerical example. It illustrate that index learning with Q learning with UCB has faster convergence that $ε$ greedy. Further, PhaseQL (with UCB and $ε$ greedy) has the best convergence than other Q-learning algorithms.

Faster Q-Learning Algorithms for Restless Bandits

TL;DR

The Whittle index learning algorithm for restless multi-armed bandits (RMAB) is studied and it is illustrated using numerical example that Q-learning with UCB exploration policy has faster convergence and PhaseQL with UCB have fastest convergence rate.

Abstract

We study the Whittle index learning algorithm for restless multi-armed bandits (RMAB). We first present Q-learning algorithm and its variants -- speedy Q-learning (SQL), generalized speedy Q-learning (GSQL) and phase Q-learning (PhaseQL). We also discuss exploration policies -- -greedy and Upper confidence bound (UCB). We extend the study of Q-learning and its variants with UCB policy. We illustrate using numerical example that Q-learning with UCB exploration policy has faster convergence and PhaseQL with UCB have fastest convergence rate. We next extend the study of Q-learning variants for index learning to RMAB. The algorithm of index learning is two-timescale variant of stochastic approximation, on slower timescale we update index learning scheme and on faster timescale we update Q-learning assuming fixed index value. We study constant stepsizes two timescale stochastic approximation algorithm. We describe the performance of our algorithms using numerical example. It illustrate that index learning with Q learning with UCB has faster convergence that greedy. Further, PhaseQL (with UCB and greedy) has the best convergence than other Q-learning algorithms.
Paper Structure (19 sections, 32 equations, 3 figures, 1 algorithm)

This paper contains 19 sections, 32 equations, 3 figures, 1 algorithm.

Figures (3)

  • Figure 1: Performance of different Q-learning algorithms
  • Figure 2: Performance of Phase Q-learning with $\epsilon$ greedy abd UCB algorithms
  • Figure 3: Index learning performance with different Q-learning algorithms