Fast Stochastic Policy Gradient: Negative Momentum for Reinforcement Learning

Haobin Zhang; Zhuang Yang

Fast Stochastic Policy Gradient: Negative Momentum for Reinforcement Learning

Haobin Zhang, Zhuang Yang

TL;DR

The paper addresses slow convergence in stochastic policy gradient methods for reinforcement learning and introduces SPG-NM, a fast SPG algorithm that injects negative momentum into updates while maintaining complexity comparable to accelerated gradient methods. By leveraging a momentum term and a selective update rule that compares value estimates, SPG-NM accelerates convergence and demonstrates robustness to the momentum parameter $\lambda$ across bandit and MDP tasks. Empirical results show SPG-NM outperforms PG, PG-HB, APG, and PG-Adam in many settings, with faster convergence and smaller sub-optimality gaps, though very large $\lambda$ can induce oscillations in harder problems. These findings support negative momentum as a practical acceleration mechanism for RL policy optimization, with potential for broader applications and improved learning-rate strategies in future work.

Abstract

Stochastic optimization algorithms, particularly stochastic policy gradient (SPG), report significant success in reinforcement learning (RL). Nevertheless, up to now, that how to speedily acquire an optimal solution for RL is still a challenge. To tackle this issue, this work develops a fast SPG algorithm from the perspective of utilizing a momentum, coined SPG-NM. Specifically, in SPG-NM, a novel type of the negative momentum (NM) technique is applied into the classical SPG algorithm. Different from the existing NM techniques, we have adopted a few hyper-parameters in our SPG-NM algorithm. Moreover, the computational complexity is nearly same as the modern SPG-type algorithms, e.g., accelerated policy gradient (APG), which equips SPG with Nesterov's accelerated gradient (NAG). We evaluate the resulting algorithm on two classical tasks, bandit setting and Markov decision process (MDP). Numerical results in different tasks demonstrate faster convergence rate of the resulting algorithm by comparing state-of-the-art algorithms, which confirm the positive impact of NM in accelerating SPG for RL. Also, numerical experiments under different settings confirm the robustness of our SPG-NM algorithm for some certain crucial hyper-parameters, which ride the user feel free in practice.

Fast Stochastic Policy Gradient: Negative Momentum for Reinforcement Learning

TL;DR

across bandit and MDP tasks. Empirical results show SPG-NM outperforms PG, PG-HB, APG, and PG-Adam in many settings, with faster convergence and smaller sub-optimality gaps, though very large

can induce oscillations in harder problems. These findings support negative momentum as a practical acceleration mechanism for RL policy optimization, with potential for broader applications and improved learning-rate strategies in future work.

Abstract

Paper Structure (13 sections, 3 equations, 7 figures, 3 algorithms)

This paper contains 13 sections, 3 equations, 7 figures, 3 algorithms.

Introduction
Related Work
Policy Gradient
Accelerated Gradient
Method
Markov Decision Process
SPG with Negative Momentum
Experiment
Bandit
MDP
Choice of the hyper-parameter $\lambda$
Sub-Optimality Gap
Conclusion

Figures (7)

Figure 1: A comparison between the performance of five different algorithms under bandit, and uniform policy initialization:(a)-(e) show the value functions of them.
Figure 2: A comparison between the performance of five different algorithms under bandit, and hard policy initialization:(a)-(e) show the value functions of them.
Figure 3: A comparison between the performance of five different algorithms under MDP with 5 states, 5 actions, and uniform policy initialization:(a)-(e) show the per-state value functions of them.
Figure 4: A comparison between the performance of five different algorithms under MDP with 5 states, 5 actions, and hard policy initialization:(a)-(e) show the per-state value functions of them.
Figure 5: A comparison of different $\lambda$ under MDP with 5 states, 5 actions, and a uniform policy initialization.
...and 2 more figures

Fast Stochastic Policy Gradient: Negative Momentum for Reinforcement Learning

TL;DR

Abstract

Fast Stochastic Policy Gradient: Negative Momentum for Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)