Whittle Index Learning Algorithms for Restless Bandits with Constant Stepsizes

Vishesh Mittal; Rahul Meshram; Surya Prakash

Whittle Index Learning Algorithms for Restless Bandits with Constant Stepsizes

Vishesh Mittal, Rahul Meshram, Surya Prakash

TL;DR

The Whittle index learning algorithm with Q-Iearning for restless multi-armed bandits is studied and it is illustrated that index learning with Q learning DQN and function approximations learns the Whittle index.

Abstract

We study the Whittle index learning algorithm for restless multi-armed bandits. We consider index learning algorithm with Q-learning. We first present Q-learning algorithm with exploration policies -- epsilon-greedy, softmax, epsilon-softmax with constant stepsizes. We extend the study of Q-learning to index learning for single-armed restless bandit. The algorithm of index learning is two-timescale variant of stochastic approximation, on slower timescale we update index learning scheme and on faster timescale we update Q-learning assuming fixed index value. In Q-learning updates are in asynchronous manner. We study constant stepsizes two timescale stochastic approximation algorithm. We provide analysis of two-timescale stochastic approximation for index learning with constant stepsizes. Further, we present study on index learning with deep Q-network (DQN) learning and linear function approximation with state-aggregation method. We describe the performance of our algorithms using numerical examples. We have shown that index learning with Q learning, DQN and function approximations learns the Whittle index.

Whittle Index Learning Algorithms for Restless Bandits with Constant Stepsizes

TL;DR

Abstract

Paper Structure (40 sections, 1 theorem, 36 equations, 20 figures, 2 algorithms)

This paper contains 40 sections, 1 theorem, 36 equations, 20 figures, 2 algorithms.

Introduction
Related work
Our contributions
Preliminaries on MDP and Q Learning
Q-Learning with different exploration schemes
$\epsilon-$greedy exploration scheme
Softmax exploration scheme
$\epsilon-$softmax exploration scheme
A single armed restless bandit model and Index Learning
Two-timescale index learning algorithm
Index Learning using DQN
Numerical Examples
Examples for MDP Models
Example of one-step random walk with $K = 25$
Example of one-step random walk with $K = 25$ and re-initialization
...and 25 more sections

Key Result

Lemma 1

Suppose Then $Q_n(s,a) \rightarrow Q^*(s,a)$ for all $(s,a)$ almost surely.

Figures (20)

Figure 1: Q-learning: Example with one step random walk and number of states $K=25$ without and with re-initialization
Figure 2: Index learning using Q learning: Example with One step random walk with $K=25$
Figure 3: index learning with DQN algorithm Example: one step random walk $K=5$ with re-intialization
Figure 4: Linear function approximation: Example of one-step random walk with $K = 500$ and re-initialization
Figure 5: Q-learning: Example with circular dynamic model
...and 15 more figures

Theorems & Definitions (1)

Lemma 1

Whittle Index Learning Algorithms for Restless Bandits with Constant Stepsizes

TL;DR

Abstract

Whittle Index Learning Algorithms for Restless Bandits with Constant Stepsizes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (1)