Table of Contents
Fetching ...

Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods

Sara Klein, Simon Weissmann, Leif Döring

TL;DR

This work addresses global convergence for policy gradient methods in finite-time horizon MDPs where optimal policies are non-stationary. It introduces dynamic policy gradient (DPG), training epochs backwards in time to exploit dynamic programming structure, and analyzes both simultaneous and dynamic softmax PG under exact and stochastic gradients. For tabular softmax parametrisation, the authors establish β-smoothness and weak Polyak-Łojasiewicz inequalities, deriving convergence rates: $O(1/n)$ in the exact-gradient setting with dynamic PG improving horizon-dependence from $H^5$ to $H^3$, and high-probability complexity bounds in the stochastic setting, where dynamic PG again offers explicit constants and gains in horizon scaling. The results suggest substantial practical benefits of training policies backwards in time, with potential extensions to lower-dimensional function classes and variance-reduction techniques for tighter, more realistic bounds.

Abstract

Markov Decision Processes (MDPs) are a formal framework for modeling and solving sequential decision-making problems. In finite-time horizons such problems are relevant for instance for optimal stopping or specific supply chain problems, but also in the training of large language models. In contrast to infinite horizon MDPs optimal policies are not stationary, policies must be learned for every single epoch. In practice all parameters are often trained simultaneously, ignoring the inherent structure suggested by dynamic programming. This paper introduces a combination of dynamic programming and policy gradient called dynamic policy gradient, where the parameters are trained backwards in time. For the tabular softmax parametrisation we carry out the convergence analysis for simultaneous and dynamic policy gradient towards global optima, both in the exact and sampled gradient settings without regularisation. It turns out that the use of dynamic policy gradient training much better exploits the structure of finite- time problems which is reflected in improved convergence bounds.

Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods

TL;DR

This work addresses global convergence for policy gradient methods in finite-time horizon MDPs where optimal policies are non-stationary. It introduces dynamic policy gradient (DPG), training epochs backwards in time to exploit dynamic programming structure, and analyzes both simultaneous and dynamic softmax PG under exact and stochastic gradients. For tabular softmax parametrisation, the authors establish β-smoothness and weak Polyak-Łojasiewicz inequalities, deriving convergence rates: in the exact-gradient setting with dynamic PG improving horizon-dependence from to , and high-probability complexity bounds in the stochastic setting, where dynamic PG again offers explicit constants and gains in horizon scaling. The results suggest substantial practical benefits of training policies backwards in time, with potential extensions to lower-dimensional function classes and variance-reduction techniques for tighter, more realistic bounds.

Abstract

Markov Decision Processes (MDPs) are a formal framework for modeling and solving sequential decision-making problems. In finite-time horizons such problems are relevant for instance for optimal stopping or specific supply chain problems, but also in the training of large language models. In contrast to infinite horizon MDPs optimal policies are not stationary, policies must be learned for every single epoch. In practice all parameters are often trained simultaneously, ignoring the inherent structure suggested by dynamic programming. This paper introduces a combination of dynamic programming and policy gradient called dynamic policy gradient, where the parameters are trained backwards in time. For the tabular softmax parametrisation we carry out the convergence analysis for simultaneous and dynamic policy gradient towards global optima, both in the exact and sampled gradient settings without regularisation. It turns out that the use of dynamic policy gradient training much better exploits the structure of finite- time problems which is reflected in improved convergence bounds.
Paper Structure (20 sections, 46 theorems, 261 equations, 1 figure, 2 algorithms)

This paper contains 20 sections, 46 theorems, 261 equations, 1 figure, 2 algorithms.

Key Result

Theorem 3.1

Under Assumption ass:sim, let $\mu$ be a probability measure such that $\mu(s) >0$ for all $s\in\mathcal{S}$, let $\eta = \frac{1}{5H^2 R^\ast}$ and consider the sequence $(\theta^{(n)})$ generated by Algorithm alg:simultaneous-PG with arbitrary initialisation $\theta^{(0)}$. For $\epsilon>0$ choose

Figures (1)

  • Figure 1: (a) shows the behavior of $V_0^{\pi^{\theta^{(n)}}}$ during the training steps over all epochs. (b) shows the log-log plot of the same simulation visualizing the convergence rate towards $V_0^\ast$.

Theorems & Definitions (93)

  • Theorem 3.1
  • Proposition 3.1
  • Lemma 3.2
  • Theorem 3.2
  • Theorem 4.0
  • Theorem 4.0
  • Remark A.1
  • Remark A.2
  • Lemma A.3: Performance difference lemma
  • proof
  • ...and 83 more