Table of Contents
Fetching ...

Enhancing Policy Gradient with the Polyak Step-Size Adaption

Yunxiang Li, Rui Yuan, Chen Fan, Mark Schmidt, Samuel Horváth, Robert M. Gower, Martin Takáč

TL;DR

This work tackles the sensitivity of policy-gradient methods to learning-rate choices by introducing a Polyak step-size variant tailored for RL. The approach removes the need for problem-specific constants via a stochastic SPS_max-like update and addresses unknown optimal values $V^*$ with a twin-model method, complemented by an entropy-regularized loss to prevent explosive updates. The proposed algorithm combines twin-model-based $V^*$ estimation with GPOMDP gradient estimation under an adaptive step-size, and it is empirically shown to yield faster convergence and more stable policies than Adam on standard control tasks. Overall, the method provides a practical, hyper-parameter-free adaptive learning-rate mechanism that improves sample efficiency and stability in policy-gradient RL.

Abstract

Policy gradient is a widely utilized and foundational algorithm in the field of reinforcement learning (RL). Renowned for its convergence guarantees and stability compared to other RL algorithms, its practical application is often hindered by sensitivity to hyper-parameters, particularly the step-size. In this paper, we introduce the integration of the Polyak step-size in RL, which automatically adjusts the step-size without prior knowledge. To adapt this method to RL settings, we address several issues, including unknown f* in the Polyak step-size. Additionally, we showcase the performance of the Polyak step-size in RL through experiments, demonstrating faster convergence and the attainment of more stable policies.

Enhancing Policy Gradient with the Polyak Step-Size Adaption

TL;DR

This work tackles the sensitivity of policy-gradient methods to learning-rate choices by introducing a Polyak step-size variant tailored for RL. The approach removes the need for problem-specific constants via a stochastic SPS_max-like update and addresses unknown optimal values with a twin-model method, complemented by an entropy-regularized loss to prevent explosive updates. The proposed algorithm combines twin-model-based estimation with GPOMDP gradient estimation under an adaptive step-size, and it is empirically shown to yield faster convergence and more stable policies than Adam on standard control tasks. Overall, the method provides a practical, hyper-parameter-free adaptive learning-rate mechanism that improves sample efficiency and stability in policy-gradient RL.

Abstract

Policy gradient is a widely utilized and foundational algorithm in the field of reinforcement learning (RL). Renowned for its convergence guarantees and stability compared to other RL algorithms, its practical application is often hindered by sensitivity to hyper-parameters, particularly the step-size. In this paper, we introduce the integration of the Polyak step-size in RL, which automatically adjusts the step-size without prior knowledge. To adapt this method to RL settings, we address several issues, including unknown f* in the Polyak step-size. Additionally, we showcase the performance of the Polyak step-size in RL through experiments, demonstrating faster convergence and the attainment of more stable policies.
Paper Structure (16 sections, 15 equations, 17 figures, 1 algorithm)

This paper contains 16 sections, 15 equations, 17 figures, 1 algorithm.

Figures (17)

  • Figure 1: The performance of Adam and SGD with various step-sizes evaluated on the CartPole environment brockman2016openai. The evaluation is averaged across three unique random seeds, distinct from the training seeds. The reported evaluation rewards are presented as a moving average due to oscillations.
  • Figure 2: A simplified two-step deterministic environment with a three-parameter policy. $x$, $y$, and $z$ represent the parameters of the policy. The sigmoid function is denoted by $\sigma$ with $\sigma(u) = \frac{1}{1+e^{-u}}$. Selecting a non-optimal trajectory at the first iteration leads to an increase in the probability of such a trajectory. Consequently, the update with the Polyak step-size is likely to approach infinity.
  • Figure 3: The performance of twin-model method on convex and non-convex problem comparing with SGD with various step-sizes and SPS (SPS$_\text{max}$).
  • Figure 4: Compare the performance of Adam and the Polyak step-size. We select the best performance of both methods with different hyper-parameters. Polyak step-size converges faster and more stable. The line is the average of three random seeds, and the shade shows the min and max.
  • Figure 5: The step-size of the Polyak step-size with twin-model method in CartPole. The step-size decreases when the policy approaches to the optimal policy.
  • ...and 12 more figures