Table of Contents
Fetching ...

Structure Matters: Dynamic Policy Gradient

Sara Klein, Xiangyuan Zhang, Tamer Başar, Simon Weissmann, Leif Döring

TL;DR

It is proved that softmax DynPG scales polynomially in the effective horizon $(1-\gamma)^{-1}$, contrasting recent exponential lower bound examples for vanilla policy gradient.

Abstract

In this work, we study $γ$-discounted infinite-horizon tabular Markov decision processes (MDPs) and introduce a framework called dynamic policy gradient (DynPG). The framework directly integrates dynamic programming with (any) policy gradient method, explicitly leveraging the Markovian property of the environment. DynPG dynamically adjusts the problem horizon during training, decomposing the original infinite-horizon MDP into a sequence of contextual bandit problems. By iteratively solving these contextual bandits, DynPG converges to the stationary optimal policy of the infinite-horizon MDP. To demonstrate the power of DynPG, we establish its non-asymptotic global convergence rate under the tabular softmax parametrization, focusing on the dependencies on salient but essential parameters of the MDP. By combining classical arguments from dynamic programming with more recent convergence arguments of policy gradient schemes, we prove that softmax DynPG scales polynomially in the effective horizon $(1-γ)^{-1}$. Our findings contrast recent exponential lower bound examples for vanilla policy gradient.

Structure Matters: Dynamic Policy Gradient

TL;DR

It is proved that softmax DynPG scales polynomially in the effective horizon , contrasting recent exponential lower bound examples for vanilla policy gradient.

Abstract

In this work, we study -discounted infinite-horizon tabular Markov decision processes (MDPs) and introduce a framework called dynamic policy gradient (DynPG). The framework directly integrates dynamic programming with (any) policy gradient method, explicitly leveraging the Markovian property of the environment. DynPG dynamically adjusts the problem horizon during training, decomposing the original infinite-horizon MDP into a sequence of contextual bandit problems. By iteratively solving these contextual bandits, DynPG converges to the stationary optimal policy of the infinite-horizon MDP. To demonstrate the power of DynPG, we establish its non-asymptotic global convergence rate under the tabular softmax parametrization, focusing on the dependencies on salient but essential parameters of the MDP. By combining classical arguments from dynamic programming with more recent convergence arguments of policy gradient schemes, we prove that softmax DynPG scales polynomially in the effective horizon . Our findings contrast recent exponential lower bound examples for vanilla policy gradient.

Paper Structure

This paper contains 38 sections, 24 theorems, 126 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Proposition 4.1

The overall error of DynPG after $H$ iterations can be decomposed as follows

Figures (4)

  • Figure 1: DynPG solves a sequence of contextual bandit problems, iteratively storing the convergent policies to memory and applying them accordingly as fixed policies in later iterations.
  • Figure 2: Success probability of achieving the sub-optimality gap of $\epsilon = 0.01$ in the overall error.
  • Figure 3: Visualization of the MDP state transitions.
  • Figure 4: Success probability of achieving the sub-optimality gap of $\epsilon=0.01$ in the overall error.

Theorems & Definitions (51)

  • Proposition 4.1
  • Corollary 4.3
  • Remark 4.4
  • Theorem 4.6
  • Remark 4.7
  • Theorem 4.8
  • Remark 4.9
  • Theorem 4.10
  • Theorem 4.11
  • Remark 4.12
  • ...and 41 more