Table of Contents
Fetching ...

Identifying Policy Gradient Subspaces

Jan Schneider, Pierre Schumacher, Simon Guist, Le Chen, Daniel Häufle, Bernhard Schölkopf, Dieter Büchler

TL;DR

This paper investigates gradient subspaces in deep policy gradient RL and demonstrates that, despite nonstationary data from exploration, policy-gradient gradients for PPO and SAC largely lie in a low-dimensional, high-curvature subspace defined by the top Hessian eigenvectors. Using Lanczos-enabled estimates of the Hessian on large on-policy or replay-buffer data, the authors quantify the gradient-subspace fraction $S_\mathrm{frac} = \frac{\|P g\|^2}{\|g\|^2}$ and show that a small number of directions dominate curvature, while subspaces remain relatively stable over training as captured by subspace overlap $S_\mathrm{overlap}$. The critic subspace is generally more stable and informative than the actor subspace, and SAC exhibits stronger subspace alignment than PPO, reflecting differences between off-policy and on-policy data distributions. The findings point to practical avenues for RL optimization, including subspace-constrained optimization and directed parameter-space exploration, and they open doors to second-order methods in high-dimensional RL settings; all code and data are publicly available for reproducibility.

Abstract

Policy gradient methods hold great potential for solving complex continuous control tasks. Still, their training efficiency can be improved by exploiting structure within the optimization problem. Recent work indicates that supervised learning can be accelerated by leveraging the fact that gradients lie in a low-dimensional and slowly-changing subspace. In this paper, we conduct a thorough evaluation of this phenomenon for two popular deep policy gradient methods on various simulated benchmark tasks. Our results demonstrate the existence of such gradient subspaces despite the continuously changing data distribution inherent to reinforcement learning. These findings reveal promising directions for future work on more efficient reinforcement learning, e.g., through improving parameter-space exploration or enabling second-order optimization.

Identifying Policy Gradient Subspaces

TL;DR

This paper investigates gradient subspaces in deep policy gradient RL and demonstrates that, despite nonstationary data from exploration, policy-gradient gradients for PPO and SAC largely lie in a low-dimensional, high-curvature subspace defined by the top Hessian eigenvectors. Using Lanczos-enabled estimates of the Hessian on large on-policy or replay-buffer data, the authors quantify the gradient-subspace fraction and show that a small number of directions dominate curvature, while subspaces remain relatively stable over training as captured by subspace overlap . The critic subspace is generally more stable and informative than the actor subspace, and SAC exhibits stronger subspace alignment than PPO, reflecting differences between off-policy and on-policy data distributions. The findings point to practical avenues for RL optimization, including subspace-constrained optimization and directed parameter-space exploration, and they open doors to second-order methods in high-dimensional RL settings; all code and data are publicly available for reproducibility.

Abstract

Policy gradient methods hold great potential for solving complex continuous control tasks. Still, their training efficiency can be improved by exploiting structure within the optimization problem. Recent work indicates that supervised learning can be accelerated by leveraging the fact that gradients lie in a low-dimensional and slowly-changing subspace. In this paper, we conduct a thorough evaluation of this phenomenon for two popular deep policy gradient methods on various simulated benchmark tasks. Our results demonstrate the existence of such gradient subspaces despite the continuously changing data distribution inherent to reinforcement learning. These findings reveal promising directions for future work on more efficient reinforcement learning, e.g., through improving parameter-space exploration or enabling second-order optimization.
Paper Structure (25 sections, 9 equations, 12 figures, 1 table)

This paper contains 25 sections, 9 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: The spectrum of the Hessian eigenvalues for PPO on the tasks Finger-spin (\ref{['fig:eigenspectrum_dmc_finger_spin_policy']}, \ref{['fig:eigenspectrum_dmc_finger_spin_vf']}) and Walker2D (\ref{['fig:eigenspectrum_gym_walker2d_policy']}, \ref{['fig:eigenspectrum_gym_walker2d_vf']}). The Hessian is estimated from $10^6$ state-action pairs. For both the actor (\ref{['fig:eigenspectrum_dmc_finger_spin_policy']}, \ref{['fig:eigenspectrum_gym_walker2d_policy']}) and critic (\ref{['fig:eigenspectrum_dmc_finger_spin_vf']}, \ref{['fig:eigenspectrum_gym_walker2d_vf']}) loss, there is a small number of large eigenvalues, while the bulk of the eigenvalues is close to zero. This finding shows that there is a small number of high-curvature directions in the loss landscapes, which is in accordance with results from SL.
  • Figure 2: The fraction $S_\mathrm{frac}$ of the gradient that lies within the high-curvature subspace spanned by the 100 largest Hessian eigenvectors. Results are displayed for the actor (top) and critic (bottom) of PPO and SAC on the Ant, Finger-spin, LunarLanderContinuous, and Walker2D tasks. The results demonstrate that a significant fraction of the gradient lies within the high-curvature subspace, but the extent to which the gradient is contained in the subspace depends on the algorithm, task, and training phase. For both algorithms, the gradient subspace fraction is significantly higher for the critic than for the actor. Furthermore, the quantity is also often larger for SAC's actor than for PPO's, particularly in the early stages of the training. Even with mini-batch estimates for the gradient and Hessian, the gradient subspace fraction is considerable.
  • Figure 3: Evolution of the overlap between the high-curvature subspace identified at an early timestep $t_1=100{,}000$ and later timesteps for the actor and critic of PPO and SAC. While the overlap between the subspaces degrades as the networks are updated, it remains considerable even after $400{,}000$ timesteps, indicating that the subspace remains similar, even under significant changes in the network parameters and the data distribution. This finding implies that information about the gradient subspace at earlier timesteps can be reused at later timesteps.
  • Figure 4: Learning curves for PPO and SAC on tasks from OpenAI Gym brockman2016openai, Gym Robotics plappert2018multi, and the DeepMind Control Suite tunyasuvunakool2020dm_control. We use the algorithm implementations of Stable Baselines3raffin2021stable with tuned hyperparameters from RL Baselines3 Zooraffin2020rl for the Gym tasks and hyperparameters tuned by random search over 50 configurations for the Gym Robotics and DeepMind Control Suite tasks. Results are averaged over ten random seeds; shaded areas represent the standard deviation across seeds.
  • Figure 5: The evolution of the fraction of the gradient that lies within the high-curvature subspace throughout the training for PPO on all tasks. Evaluation of gradient subspaces with different numbers of eigenvectors. Results for the actor and critic.
  • ...and 7 more figures