Table of Contents
Fetching ...

A Closer Look at Deep Policy Gradients

Andrew Ilyas, Logan Engstrom, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, Aleksander Madry

TL;DR

The paper questions whether deep policy gradient practice (notably PPO and TRPO) faithfully reflects its theoretical underpinnings. Through a fine-grained empirical analysis of gradient estimation, value prediction, and optimization landscapes, it reveals that gradient estimates are noisy and poorly correlated with the true gradient, that value networks fit supervised targets but not the true value function, and that the surrogate objective landscape can mispredict true reward behavior. These findings highlight a gap between theory and practice, suggesting that improving deep RL requires a multi-faceted understanding of its primitives rather than relying on benchmark performance alone. The work calls for refined theory and evaluation methods to yield more reliable and robust deep RL algorithms.

Abstract

We study how the behavior of deep policy gradient algorithms reflects the conceptual framework motivating their development. To this end, we propose a fine-grained analysis of state-of-the-art methods based on key elements of this framework: gradient estimation, value prediction, and optimization landscapes. Our results show that the behavior of deep policy gradient algorithms often deviates from what their motivating framework would predict: the surrogate objective does not match the true reward landscape, learned value estimators fail to fit the true value function, and gradient estimates poorly correlate with the "true" gradient. The mismatch between predicted and empirical behavior we uncover highlights our poor understanding of current methods, and indicates the need to move beyond current benchmark-centric evaluation methods.

A Closer Look at Deep Policy Gradients

TL;DR

The paper questions whether deep policy gradient practice (notably PPO and TRPO) faithfully reflects its theoretical underpinnings. Through a fine-grained empirical analysis of gradient estimation, value prediction, and optimization landscapes, it reveals that gradient estimates are noisy and poorly correlated with the true gradient, that value networks fit supervised targets but not the true value function, and that the surrogate objective landscape can mispredict true reward behavior. These findings highlight a gap between theory and practice, suggesting that improving deep RL requires a multi-faceted understanding of its primitives rather than relying on benchmark performance alone. The work calls for refined theory and evaluation methods to yield more reliable and robust deep RL algorithms.

Abstract

We study how the behavior of deep policy gradient algorithms reflects the conceptual framework motivating their development. To this end, we propose a fine-grained analysis of state-of-the-art methods based on key elements of this framework: gradient estimation, value prediction, and optimization landscapes. Our results show that the behavior of deep policy gradient algorithms often deviates from what their motivating framework would predict: the surrogate objective does not match the true reward landscape, learned value estimators fail to fit the true value function, and gradient estimates poorly correlate with the "true" gradient. The mismatch between predicted and empirical behavior we uncover highlights our poor understanding of current methods, and indicates the need to move beyond current benchmark-centric evaluation methods.

Paper Structure

This paper contains 28 sections, 19 equations, 20 figures, 1 table.

Figures (20)

  • Figure 1: Empirical variance of the estimated gradient (c.f. \ref{['eqn:grad_sr']}) as a function of the number of state-action pairs used in estimation in the MuJoCo Humanoid task. We measure the average pairwise cosine similarity between ten repeated gradient measurements taken from the same policy, with the $95\%$ confidence intervals (shaded). For each algorithm, we perform multiple trials with the same hyperparameter configurations but different random seeds, shown as repeated lines in the figure. The vertical line (at $x = 2$K) indicates the sample regime used for gradient estimation in standard implementations of policy gradient methods. In general, it seems that obtaining tightly concentrated gradient estimates would require significantly more samples than are used in practice, particularly after the first few timesteps. For other tasks -- such as Walker2d-v2 and Hopper-v2 -- the plots (seen in Appendix Figure \ref{['fig:gradvar_app']}) have similar trends, except that gradient variance is slightly lower. Confidence intervals calculated with 500 sample bootstrapping.
  • Figure 2: Convergence of gradient estimates (c.f. \ref{['eqn:grad_sr']}) to the "true" expected gradient in the MuJoCo Humanoid task. We measure the mean cosine similarity between the "true" gradient approximated using ten million state-action pairs, and ten gradient estimates which use increasing numbers of state-action pairs (with 95% confidence intervals). For each algorithm, we perform multiple trials with the same hyperparameter configurations but different random seeds. The vertical line (at $x = 2$K) indicates the sample regime used for gradient estimation in standard implementations of policy gradient methods. Observe that although it is possible to empirically estimate the true gradient, this requires several-fold more samples than are used commonly in practical applications of these algorithms. See additionally that the estimation task becomes more difficult further into training. For other tasks -- such as Walker2d-v2 and Hopper-v2 -- the plots (seen in Appendix Figure \ref{['fig:truegrad_app']}) have similar trends, except that gradient estimation is slightly better. Confidence intervals calculated with 500 sample bootstrapping.
  • Figure 3: Quality of value prediction in terms of mean relative error (MRE) on heldout state-action pairs for agents trained to solve the MuJoCo Walker2d-v2 task. We observe in (left) that the agents do indeed succeed at solving the supervised learning task they are trained for---the MRE on the GAE-based value loss $(V_{old} + A_{GAE})^2$ (c.f. \ref{['eq:val_targ']}) is small. On the other hand, in (right) we see that the returns MRE is still quite high---the learned value function is off by about $50\%$ with respect to the underlying true value function. Similar plots for other MuJoCo tasks are in Appendix \ref{['app:value_pred']}.
  • Figure 4: Efficacy of the value network as a variance reducing baseline for Walker2d-v2 (top) and Hopper-v2 (bottom) agents. We measure the empirical variance of the gradient (c.f. \ref{['eqn:grad_sr']}) as a function of the number of state-action pairs used in estimation, for different choices of baseline functions: the value network (used by the agent in training), the "true" value function (fit to the returns using $5\cdot 10^6$ state-action pairs sampled from the current policy) and the "zero" value function (i.e. replacing advantages with returns). We observe that using the true value function leads to a significantly lower-variance estimate of the gradient compared to the value network. In turn, employing the value network yields a noticeable variance reduction compared to the zero baseline function, even though this difference may appear rather small in the small-sample regime ($2$K). Confidence intervals calculated with 10 sample bootstrapping.
  • Figure 5: True reward landscape concentration for TRPO on Humanoid-v2. We visualize the landscape at a training iteration 150 while varying the number of trajectories used in reward estimation (each subplot), both in the direction of the step taken and a random direction. Moving one unit along the "step direction" axis corresponds to moving one full step in parameter space. In the random direction one unit corresponds to moving along a random norm $2$ Gaussian vector in the parameter space. In practice, the norm of the step is typically an order of magnitude lower than the random direction. While the landscape is very noisy in the low-sample regime, large numbers of samples reveal a well-behaved underlying landscape. See Figures \ref{['fig:ppo_landscape_concentration']}, \ref{['fig:trpo_landscape_concentration']} of the Appendix for additional plots.
  • ...and 15 more figures