Table of Contents
Fetching ...

A Diffusion Analysis of Policy Gradient for Stochastic Bandits

Tor Lattimore

Abstract

We study a continuous-time diffusion approximation of policy gradient for $k$-armed stochastic bandits. We prove that with a learning rate $η= O(Δ^2/\log(n))$ the regret is $O(k \log(k) \log(n) / η)$ where $n$ is the horizon and $Δ$ the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless $η= O(Δ^2)$.

A Diffusion Analysis of Policy Gradient for Stochastic Bandits

Abstract

We study a continuous-time diffusion approximation of policy gradient for -armed stochastic bandits. We prove that with a learning rate the regret is where is the horizon and the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless .
Paper Structure (28 sections, 11 theorems, 64 equations, 2 figures, 2 algorithms)

This paper contains 28 sections, 11 theorems, 64 equations, 2 figures, 2 algorithms.

Key Result

Lemma 1

$\sum_{a=1}^k \theta_{t,a} = 0$ almost surely.

Figures (2)

  • Figure 1: The plot shows 40 trajectories of $\pi_{t,1}$ produced by \ref{['alg:pg']} on the instance of \ref{['thm:lower']} for 6 different learning rates with $\Delta_{2} = 0.002$.
  • Figure 2: The figure shows the results for the same experiment as in \ref{['fig:lower']} but with $k = 3$, suggesting that the logarithmic number of arms used in the lower bound is needed.

Theorems & Definitions (24)

  • Lemma 1: conservation
  • proof
  • Lemma 2
  • Lemma 3
  • Proposition 4
  • proof
  • Remark 5
  • Theorem 6
  • Lemma 7
  • Lemma 8
  • ...and 14 more