Identifying Policy Gradient Subspaces
Jan Schneider, Pierre Schumacher, Simon Guist, Le Chen, Daniel Häufle, Bernhard Schölkopf, Dieter Büchler
TL;DR
This paper investigates gradient subspaces in deep policy gradient RL and demonstrates that, despite nonstationary data from exploration, policy-gradient gradients for PPO and SAC largely lie in a low-dimensional, high-curvature subspace defined by the top Hessian eigenvectors. Using Lanczos-enabled estimates of the Hessian on large on-policy or replay-buffer data, the authors quantify the gradient-subspace fraction $S_\mathrm{frac} = \frac{\|P g\|^2}{\|g\|^2}$ and show that a small number of directions dominate curvature, while subspaces remain relatively stable over training as captured by subspace overlap $S_\mathrm{overlap}$. The critic subspace is generally more stable and informative than the actor subspace, and SAC exhibits stronger subspace alignment than PPO, reflecting differences between off-policy and on-policy data distributions. The findings point to practical avenues for RL optimization, including subspace-constrained optimization and directed parameter-space exploration, and they open doors to second-order methods in high-dimensional RL settings; all code and data are publicly available for reproducibility.
Abstract
Policy gradient methods hold great potential for solving complex continuous control tasks. Still, their training efficiency can be improved by exploiting structure within the optimization problem. Recent work indicates that supervised learning can be accelerated by leveraging the fact that gradients lie in a low-dimensional and slowly-changing subspace. In this paper, we conduct a thorough evaluation of this phenomenon for two popular deep policy gradient methods on various simulated benchmark tasks. Our results demonstrate the existence of such gradient subspaces despite the continuously changing data distribution inherent to reinforcement learning. These findings reveal promising directions for future work on more efficient reinforcement learning, e.g., through improving parameter-space exploration or enabling second-order optimization.
