Faster Q-Learning Algorithms for Restless Bandits

Parvish Kakarapalli; Devendra Kayande; Rahul Meshram

Faster Q-Learning Algorithms for Restless Bandits

Parvish Kakarapalli, Devendra Kayande, Rahul Meshram

TL;DR

The Whittle index learning algorithm for restless multi-armed bandits (RMAB) is studied and it is illustrated using numerical example that Q-learning with UCB exploration policy has faster convergence and PhaseQL with UCB have fastest convergence rate.

Abstract

We study the Whittle index learning algorithm for restless multi-armed bandits (RMAB). We first present Q-learning algorithm and its variants -- speedy Q-learning (SQL), generalized speedy Q-learning (GSQL) and phase Q-learning (PhaseQL). We also discuss exploration policies -- $ε$-greedy and Upper confidence bound (UCB). We extend the study of Q-learning and its variants with UCB policy. We illustrate using numerical example that Q-learning with UCB exploration policy has faster convergence and PhaseQL with UCB have fastest convergence rate. We next extend the study of Q-learning variants for index learning to RMAB. The algorithm of index learning is two-timescale variant of stochastic approximation, on slower timescale we update index learning scheme and on faster timescale we update Q-learning assuming fixed index value. We study constant stepsizes two timescale stochastic approximation algorithm. We describe the performance of our algorithms using numerical example. It illustrate that index learning with Q learning with UCB has faster convergence that $ε$ greedy. Further, PhaseQL (with UCB and $ε$ greedy) has the best convergence than other Q-learning algorithms.

Faster Q-Learning Algorithms for Restless Bandits

TL;DR

Abstract

-greedy and Upper confidence bound (UCB). We extend the study of Q-learning and its variants with UCB policy. We illustrate using numerical example that Q-learning with UCB exploration policy has faster convergence and PhaseQL with UCB have fastest convergence rate. We next extend the study of Q-learning variants for index learning to RMAB. The algorithm of index learning is two-timescale variant of stochastic approximation, on slower timescale we update index learning scheme and on faster timescale we update Q-learning assuming fixed index value. We study constant stepsizes two timescale stochastic approximation algorithm. We describe the performance of our algorithms using numerical example. It illustrate that index learning with Q learning with UCB has faster convergence that

greedy. Further, PhaseQL (with UCB and

greedy) has the best convergence than other Q-learning algorithms.

Paper Structure (19 sections, 32 equations, 3 figures, 1 algorithm)

This paper contains 19 sections, 32 equations, 3 figures, 1 algorithm.

Introduction
Related Work
Our contributions
Preliminaries on Q-learning and its variants
Speedy Q Learning (SQL)
Generalized speedy Q Learning (GSQL)
Phase Q Learning (PhaseQL)
Action selection policy: $\epsilon$-greedy and UCB policy
Numerical example with no structure on transition model
Restless Bandits Formulation
Single-armed restless bandit problem
Two-timescale index learning algorithm
Index learning with Q-RL
Index learning with SQL
Index learning with GSQL
...and 4 more sections

Figures (3)

Figure 1: Performance of different Q-learning algorithms
Figure 2: Performance of Phase Q-learning with $\epsilon$ greedy abd UCB algorithms
Figure 3: Index learning performance with different Q-learning algorithms

Faster Q-Learning Algorithms for Restless Bandits

TL;DR

Abstract

Faster Q-Learning Algorithms for Restless Bandits

Authors

TL;DR

Abstract

Table of Contents

Figures (3)