A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

Tor Lattimore

A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

Tor Lattimore

Abstract

We adapt the analysis of policy gradient for continuous time $k$-armed stochastic bandits by Lattimore (2026) to the standard discrete time setup. As in continuous time, we prove that with learning rate $η= O(Δ_{\min}^2/(Δ_{\max} \log(n)))$ the regret is $O(k \log(k) \log(n) / η)$ where $n$ is the horizon and $Δ_{\min}$ and $Δ_{\max}$ are the minimum and maximum gaps.

A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

Abstract

We adapt the analysis of policy gradient for continuous time

-armed stochastic bandits by Lattimore (2026) to the standard discrete time setup. As in continuous time, we prove that with learning rate

the regret is

where

is the horizon and

and

are the minimum and maximum gaps.

A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

Abstract

A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (7)