Table of Contents
Fetching ...

Improving Value Estimation Critically Enhances Vanilla Policy Gradient

Tao Wang, Ruipeng Zhang, Sicun Gao

TL;DR

It is shown that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments.

Abstract

Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.

Improving Value Estimation Critically Enhances Vanilla Policy Gradient

TL;DR

It is shown that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments.

Abstract

Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.

Paper Structure

This paper contains 30 sections, 2 theorems, 26 equations, 12 figures, 4 tables.

Key Result

Theorem 5.2

Assume that the dynamics, reward function, policy and value networks are all Lipschitz continuous with respect to their input variables. Let $\beta_1, \beta_2$ denote the learning rate for policy and value network, respectively, and $K_V$ denote the number of value steps per epoch. Then for each pol steps made to the value network when $\beta_1, \beta_2$ are small. $\alpha = \frac{-\log \gamma}{\l

Figures (12)

  • Figure 1: Policy objectives in many continuous-control environments are highly non-smooth and fractal.
  • Figure 2: We compare the performance of different implementations on the MuJoCo Hopper task. The clipping parameter is set to $\epsilon = 0.2$ as default.
  • Figure 3: The cumulative reward and value estimation error during PPO training in the Hopper task are compared between full-batch and mini-batch updates. It highlights how the use of full-batch updates leads to suboptimal policy performance, as reflected in the large value estimation errors, while mini-batch updates facilitate more accurate value estimation and better cumulative reward outcomes.
  • Figure 4: Training curves on Gymnasium benchmarks. The curve VPG-repeat-k corresponds to the vanilla policy gradient algorithm with $k$ value steps applied each iteration. For example, VPG-repeat-1 represents the original vanilla policy gradient implementation. As the number of value steps increases, the performance of vanilla policy gradient consistently improves, eventually converging to or outperforming PPO when the number of value steps reaches 50 or more.
  • Figure 5: The corresponding value estimation difference in the experiments shown in Figure \ref{['fig:main']}. We disabled exploration during evaluation, using the deterministic policy as the direct output of the policy network. The difference in value estimation is computed through Equation \ref{['eq:vale']}. We observe a clear correlation between the value estimation difference in VPG and its performance. As the value steps increase, the estimation error decreases and eventually oscillates around zero, leading to improved performance. More results can be found in Figure \ref{['fig:original_value']}.
  • ...and 7 more figures

Theorems & Definitions (4)

  • Definition 5.1
  • Theorem 5.2
  • Definition 3.1
  • Proposition 3.2