Table of Contents
Fetching ...

A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

Tor Lattimore

Abstract

We adapt the analysis of policy gradient for continuous time $k$-armed stochastic bandits by Lattimore (2026) to the standard discrete time setup. As in continuous time, we prove that with learning rate $η= O(Δ_{\min}^2/(Δ_{\max} \log(n)))$ the regret is $O(k \log(k) \log(n) / η)$ where $n$ is the horizon and $Δ_{\min}$ and $Δ_{\max}$ are the minimum and maximum gaps.

A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

Abstract

We adapt the analysis of policy gradient for continuous time -armed stochastic bandits by Lattimore (2026) to the standard discrete time setup. As in continuous time, we prove that with learning rate the regret is where is the horizon and and are the minimum and maximum gaps.

Paper Structure

This paper contains 10 sections, 4 theorems, 14 equations, 1 algorithm.

Key Result

Theorem 1

If $\eta \leq \frac{\Delta_{\min}^2}{120 \Delta_{\max} \log(nk)}$, then the regret of alg:pg satisfies $\mathbb E[\operatorname{Reg}_n] = O\left(\frac{k \log(n) \log(k)}{\eta}\right)$.

Theorems & Definitions (7)

  • Theorem 1
  • Definition 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • proof : Proof of \ref{['thm:upper']}
  • Remark 6