Table of Contents
Fetching ...

Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning

Yen-Ju Chen, Nai-Chieh Huang, Ching-Pei Lee, Ping-Chun Hsieh

TL;DR

This paper introduces Accelerated Policy Gradient (APG), which applies Nesterov momentum to policy gradient optimization in reinforcement learning. It proves that, under tabular softmax parameterization, APG achieves a $\tilde{O}(1/t^2)$ convergence rate with constant step sizes and a linear rate with exponentially growing step sizes, by leveraging a local nearly-concave structure of the RL objective and an absorbing momentum regime. The authors establish almost-sure convergence to the optimal policy for APG, characterize feasible update directions, and show APG attains improved rates over standard PG in general MDPs. Empirical results on 5-state MDPs and Atari 2600 benchmarks corroborate the theoretical gains, demonstrating faster convergence than PG and HBPG. The work also discusses lower bounds for standard PG and outlines future directions, including stochastic gradient extensions and regularization-induced acceleration.

Abstract

Various acceleration approaches for Policy Gradient (PG) have been analyzed within the realm of Reinforcement Learning (RL). However, the theoretical understanding of the widely used momentum-based acceleration method on PG remains largely open. In response to this gap, we adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in RL, termed \textit{Accelerated Policy Gradient} (APG). To demonstrate the potential of APG in achieving fast convergence, we formally prove that with the true gradient and under the softmax policy parametrization, APG converges to an optimal policy at rates: (i) $\tilde{O}(1/t^2)$ with constant step sizes; (ii) $O(e^{-ct})$ with exponentially-growing step sizes. To the best of our knowledge, this is the first characterization of the convergence rates of NAG in the context of RL. Notably, our analysis relies on one interesting finding: Regardless of the parameter initialization, APG ends up entering a locally nearly-concave regime, where APG can significantly benefit from the momentum, within finite iterations. Through numerical validation and experiments on the Atari 2600 benchmarks, we confirm that APG exhibits a $\tilde{O}(1/t^2)$ rate with constant step sizes and a linear convergence rate with exponentially-growing step sizes, significantly improving convergence over the standard PG.

Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning

TL;DR

This paper introduces Accelerated Policy Gradient (APG), which applies Nesterov momentum to policy gradient optimization in reinforcement learning. It proves that, under tabular softmax parameterization, APG achieves a convergence rate with constant step sizes and a linear rate with exponentially growing step sizes, by leveraging a local nearly-concave structure of the RL objective and an absorbing momentum regime. The authors establish almost-sure convergence to the optimal policy for APG, characterize feasible update directions, and show APG attains improved rates over standard PG in general MDPs. Empirical results on 5-state MDPs and Atari 2600 benchmarks corroborate the theoretical gains, demonstrating faster convergence than PG and HBPG. The work also discusses lower bounds for standard PG and outlines future directions, including stochastic gradient extensions and regularization-induced acceleration.

Abstract

Various acceleration approaches for Policy Gradient (PG) have been analyzed within the realm of Reinforcement Learning (RL). However, the theoretical understanding of the widely used momentum-based acceleration method on PG remains largely open. In response to this gap, we adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in RL, termed \textit{Accelerated Policy Gradient} (APG). To demonstrate the potential of APG in achieving fast convergence, we formally prove that with the true gradient and under the softmax policy parametrization, APG converges to an optimal policy at rates: (i) with constant step sizes; (ii) with exponentially-growing step sizes. To the best of our knowledge, this is the first characterization of the convergence rates of NAG in the context of RL. Notably, our analysis relies on one interesting finding: Regardless of the parameter initialization, APG ends up entering a locally nearly-concave regime, where APG can significantly benefit from the momentum, within finite iterations. Through numerical validation and experiments on the Atari 2600 benchmarks, we confirm that APG exhibits a rate with constant step sizes and a linear convergence rate with exponentially-growing step sizes, significantly improving convergence over the standard PG.
Paper Structure (50 sections, 54 theorems, 247 equations, 7 figures, 10 tables, 7 algorithms)

This paper contains 50 sections, 54 theorems, 247 equations, 7 figures, 10 tables, 7 algorithms.

Key Result

Theorem 1

Consider a tabular softmax parameterized policy $\pi_{\theta}$. For algorithm:APG with $\eta^{(t)} = \frac{t}{t+1} \frac{(1 - \gamma)^3}{16}$ and $\mu$ initialized uniformly at random, the following holds almost surely:

Figures (7)

  • Figure 1: The value function $V(s)$ versus the policy parameter $\theta_{a^*}$ and $\theta_{a_2}$ under a 2-armed bandit problem.
  • Figure 2: A comparison between the performance of APG, PG, and HBPG under an MDP with 5 states, 5 actions, with the uniform and hard policy initialization: (a)-(b) show the sub-optimality gaps under the uniform and the hard initialization, respectively; (c)-(d) show the one-step improvements of APG from the momentum (i.e., $V^{\pi_{\omega}^{(t)}}(\rho)-V^{\pi_{\theta}^{(t)}}(\rho)$) and the gradient (i.e., $V^{\pi_{\theta}^{(t+1)}}(\rho)-V^{\pi_{\omega}^{(t)}}(\rho)$), under the uniform and the hard initialization, respectively.
  • Figure 3: A comparison of the performance of APG and the benchmark algorithms in four Atari 2600 games. All the results are averaged over 5 random seeds (with the shaded area showing the range of $\text{mean} \pm \text{std}$).
  • Figure 4: The sub-optimality gap of APG with time-varying step sizes under an MDP of five states and five actions with uniform initialization.
  • Figure 5: The one-step improvement of APG on a three-action bandit problem.
  • ...and 2 more figures

Theorems & Definitions (119)

  • Remark 1: Step-Size Regimes
  • Remark 2
  • Theorem 1: Asymptotic Convergence Under Softmax Parameterization
  • Remark 3
  • Remark 4
  • Definition 1: $C$-Near Concavity
  • Definition 2
  • Lemma 1: Locally $C$-Near Concavity; Informal
  • Remark 5
  • Remark 6
  • ...and 109 more