A Diffusion Analysis of Policy Gradient for Stochastic Bandits

Tor Lattimore

A Diffusion Analysis of Policy Gradient for Stochastic Bandits

Tor Lattimore

Abstract

We study a continuous-time diffusion approximation of policy gradient for $k$-armed stochastic bandits. We prove that with a learning rate $η= O(Δ^2/\log(n))$ the regret is $O(k \log(k) \log(n) / η)$ where $n$ is the horizon and $Δ$ the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless $η= O(Δ^2)$.

A Diffusion Analysis of Policy Gradient for Stochastic Bandits

Abstract

We study a continuous-time diffusion approximation of policy gradient for

-armed stochastic bandits. We prove that with a learning rate

the regret is

where

is the horizon and

the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless

Paper Structure (28 sections, 11 theorems, 64 equations, 2 figures, 2 algorithms)

This paper contains 28 sections, 11 theorems, 64 equations, 2 figures, 2 algorithms.

Introduction
Basic notation
Bandit notation
Contribution
Related work
Policy gradient in continuous time
Elementary properties
Upper bounds
Lower bound
Lower bound construction
Step 1: Dynamics and intuition
Step 2: Formal details
Step 4: The calculations
Discussion
Continuous time vs discrete time
...and 13 more sections

Key Result

Lemma 1

$\sum_{a=1}^k \theta_{t,a} = 0$ almost surely.

Figures (2)

Figure 1: The plot shows 40 trajectories of $\pi_{t,1}$ produced by \ref{['alg:pg']} on the instance of \ref{['thm:lower']} for 6 different learning rates with $\Delta_{2} = 0.002$.
Figure 2: The figure shows the results for the same experiment as in \ref{['fig:lower']} but with $k = 3$, suggesting that the logarithmic number of arms used in the lower bound is needed.

Theorems & Definitions (24)

Lemma 1: conservation
proof
Lemma 2
Lemma 3
Proposition 4
proof
Remark 5
Theorem 6
Lemma 7
Lemma 8
...and 14 more

A Diffusion Analysis of Policy Gradient for Stochastic Bandits

Abstract

A Diffusion Analysis of Policy Gradient for Stochastic Bandits

Authors

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (24)